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Abstract 

VLSI  design  in  general-microprocessor  design  in  particuiar-has  been  treated  more  like  an  art 
than  a  science  in  the  past.  The  goal  of  this  thesis  is  to  explain  the  science  of  VLSI  design  to 
someone  who  wants  to  build  a  microprocessor.  This  can  be  accomplished  by  providing  a  quanti¬ 
tative  way  to  evaluate,  and  a  systematic  approach  to  design,  a  microprocessor.  Resources  and 
complexity  are  two  separate  ways  a  microprocessor  designer  can  pay  for  performance.  The  mi¬ 
croprocessor  designer  must  evaluate  the  performance,  resources,  and  complexity  tradeoffs  quanti¬ 
tatively.  In  this  thesis,  the  SPUR  (SPUR  stands  for  Symbolic  Processing  Using  RISC  Machines) 
CPU  microarchitecure  is  used  as  example  to  show  how  performance,  resources,  complexity  trade¬ 
offs  can  be  evaluated  quantitatively.  A  systematic  approach  to  microarchitectural  design  is  then 
developed  based  on  the  SPUR  CPU  design  experience.  The  SPUR  CPU  is  implemented  in 
L6pra,  double  layer  metal,  CMOS  technology.  It  consists  of  115,000  transistors,  runs  at  lOOns, 
and  consumes  0.8W  of  power.  Important  features  of  the  SPUR  CPU  are:  an  internal  instruction 
cache;  a  four-stage  pipeline;  support  for  LISP;  a  cache  controller  interface  for  multiprocessing 
and  virtual  memory  support;  and  a  parallel  coprocessor  interface  for  floating  point  arithmetic  sup¬ 
port.  All  dtese  features  make  the  SPUR  CPU  significantly  different  and  more  complex  titan  previ¬ 
ous  generations  of  Berkeley  RISC  machines. 
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Chapter  1 

INTRODUCTION 


What  is  hard  to  get  across  is  the  tremendous  speed  at  which 
things  are  changing  in  this  business. 

Dave  Patterson,  New  York  Times,  1988 


1.1.  The  Berkeley  Tradition 

The  history  of  VLSI  chip  projects  at  Berkeley  is  shown  in  Tablel-l-l.  This  table  also  illus¬ 
trates  the  evolution  of  VLSI  projects  in  the  research  environment  because  what  happened  in 
Berkeley  was  very  t5Tical.  In  the  early  1980s,  the  Mead  and  Conway  design  style  enabled  us  to 
build  40,000+  transistor  (large  for  the  time)  NMOS  VLSI  microprocessors  such  as  RISC  I 
[Pat82],  RISC  II  [Kat83],  and  SOAR  [Ung84].  Fueled  by  these  successes  and  further  advances  in 
tecimology,  we  started  the  SPUR  project  in  1985  after  we  completed  the  SOAR  project.  The 
SPUR  project's  ambitious  goal  is  to  build  a  multiprocessor  workstation  system  [Hil86]. 

SPUR  stands  for  Symbolic  Processing  Using  RISC  machines.  Figure  l-l-l(a)  is  a  block 
diagram  of  the  SPUR  multiprocessor.  SPUR  is  a  shared-bus  multiprocessor  consists  of  6  to  12 
identical  high-performance  processors.  These  processors  are  connected  to  each  other,  to  standard 
shared  memory,  and  to  input/output  devices  with  a  modified  TI  Nu-Bus  which  we  called  the 
SPUR  Bus  [Gib87].  Figure  1-1-1  (b)  is  an  expanded  view  of  the  SPUR  processor  board.  Each 
SPUR  processor  contains  a  I28K-byte  cache  to  reduce  the  bandwidth  required  from  the  bus  and 
tlie  shared  memory.  Each  SPUR  processor  is  implemented  on  a  single  board  with  about  200  stan¬ 
dard  chips  and  three  custom  CMOS  chips:  the  Cache  Controller  (CC),  the  Floating  Point  Unit 
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Proiect 

Cycle 
Time  (ns) 

Transistor 

Count 

Pin 

Count 

Design  Effort 
(man  year) 

Generation 

1981 

RISC  I 

800 

44K 

4-NMOS 

62 

2.3 

1st 

1983  RISC  n 
Chip  A 

ChipB 

500 

330 

41K 

4-NMOS 

3-NMOS 

62 

2.8 

1985 

SOAR 

400 

36K 

4-NMOS 

84 

3.3 

2nd 

1988  SPUR 
CPU 

CC 

FPU 

100 

120K 

68K 

lOOK 

1.6-CMOS 

208 

4.5 

3rd 

Table  1-1-1  The  Berkeley  Tradition 


(FPU),  and  the  Central  Processing  Unit  (CPU). 

The  Cache  Controller  handles  cache  accesses,  performs  address  translation  [Woo86], 
accesses  shared  memory  over  the  shared  bus,  and  maintains  cache  consistency  [KEW85].  The 
Hoating  Point  Unit  [BPT87]  supports  the  IEEE  standard  for  binary  floating-point  arithmetic. 
Finally,  the  CPU  is  based  on  the  Berkeley  RISC  architecture.  The  SPUR  CPU,  however,  is  dif¬ 
ferent  from  RISC  11  because  it  has  a  512-byte  internal  instruction  cache,  a  longer  pipeline,  a 
coprocessor  interface,  and  support  for  LISP.  These  three  custom  VLSI  chips  are  implemented  in  a 
1.6pm  double  layer  metal  CMOS  technology  and  each  consists  of  approximately  100,000  transis¬ 
tors. 

1.2.  Research  Motivation 

The  research  reported  in  this  thesis  is  motivated  by  the  design  of  the  SPUR  CPU.  Micropro¬ 
cessor  design  is  influenced  by  many  different  issues,  and  their  effects  were  studied  by  Katevenis 
in  1983  [Kat83].  Since  then,  however,  many  old  design  issues  have  changed  and  many  new 
design  issues  have  emerged  due  to  influences  from  four  areas: 
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(a)  SPUR  Workstation  Basic  Architecture  (b)  SPUR  Processor  Board 


Figure  1-1-1  The  SPUR  Workstation 


The  SPUR  multiprocessor  workstation  is  a  shared-bus  multiprocessor  which  consists  of  6  to  12 
identical  high-performance  custom  processors.  Each  processor  contains  three  custom  VLSI 
CMOS  chips:  the  CPU,  the  CC,  and  the  FPU.  These  three  chips  are  connected  by  a  38-bit  address 
bus,  a  64-bit  data  bus,  and  the  CPU  and  FPU  are  connected  by  a  parallel  coprocessor  interface. 
The  CPU  only  uses  the  lower  32  bits  of  the  address  bus  and  the  lower  40  bits  of  the  data  bus. 


System 

There  is  demand  for  more  support  for  coprocessors,  memory  management,  multiple  proces¬ 
sors,  and  operating  systems. 

Software 

There  is  demand  for  more  support  for  specialized  languages.  Better  compiler  technology  is 
also  available  for  better  hardware-software  trade-offs. 

Simulation 

The  higher  demands  in  the  system  and  software  areas  increase  the  popularity  of  multiple 
chip  projects  such  as  Berkeley’s  SPUR,  XEROX’S  Dragon  [MoS85],  DEC  s  Firefly 
[TSJ88],  and  HP’s  Spectrum  project  [BiW86].  A  project  spanning  multiple  chip  designs 
requires  significantly  more  detailed  behavioral  simulation  to  resolve  communication  and 
interaction  problems  among  the  chips.  The  need  for  detailed  simulation  is  especially  true 
with  respect  to  exceptional  conditions  such  as  interrupts  and  traps. 
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Technology 

CMOS  with  higher  speed  and  lower  power  consumption  is  replacing  NMOS.  As  devices 
scale  to  smaller  geometries  and  as  the  chip  area  increases,  many  electrical  problems,  such  as 
inductance,  can  no  longer  be  ignored.  Furthermore  as  more  functions  can  be  placed  on-chip, 
on-chip  interactions  become  more  complex  while  off-chip  communication  remains  a  major 
bottleneck. 

We  believe  the  problems  we  faced  in  designing  the  SPUR  CPU  were  a  preview  of  what  the 
rest  of  the  research  community  will  need  to  confront  in  the  near  future.  There  are  two  terms  that  I 
will  use  often  in  this  thesis.  Before  going  any  further,  let  me  clarify  my  definitions  to  avoid  con¬ 
fusion  latter. 

Macroarchitecture 

The  term  macroarchitecture  can  be  defined  as  the  machine  language  programmer  s  view  of 
the  processor,  generally  found  in  the  machine  language  programmer’s  manual.  For  a 
microprocessor,  however,  a  machine  language  programmer’s  manual  really  does  not  teU  the 
whole  story.  The  macroarchitecture  of  a  microprocessor  should  also  include  a  interface 
specification  for  the  board  designer. 

Microarchitecture 

The  term  microarchitecture  will  be  defined  formally  in  Chapter  5.  In  the  meantime,  it  is 
defined  informally  as  the  specification  of  how  the  macroarchitccture  is  implemented  in  a 
given  technology.  The  microarchitccture  may  have  some  impact  on  the  macroarchitecture. 
This  feedback  path  is  one  of  the  main  tenets  of  the  original  RISC  argument 

1.3.  Contemporary  RISC  Processors 

This  section  looks  at  several  RISC  processors  that  were  introduced  at  approximately  the 
same  time  when  the  SPUR  CPU  was  being  built.  I  have  selected  two  research  projects  and  three 
commercial  projects.  In  my  opinion,  each  of  the  selected  processors  has  its  own  significant 
feature  or  features  that  make  the  processor  deserve  a  place  in  the  short  but  brilliant  history  of 
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RISC  processors.  The  two  research  processors  I  selected  are  MIPS-X  and  CRISP: 

MIPS-X 

MIPS-X  [ChH87]  [Hor87b]  [Hor87a]  was  designed  at  Stanford  University.  It  was  imple¬ 
mented  in  2}tm,  double-layer  metal,  CMOS  technology.  It  contains  150K  transistors  in  an 
8mm  X  8.5mm  die  and  has  84  signal  pins  and  24  power  pins.  The  peak  operating  frequency 
is  20MHz  and  the  chip  dissipates  less  than  IW.  MIPS-X  has  a  32-word  register  file,  a  512- 
word  direct-mapped  (32  blocks  of  16  words)  on-chip  instruction  cache,  and  uses  a  five-stage 
pipeline.  The  five  stages  are:  (1)  Instruction  Fetch,  (2)  Register  Read,  (3)  Execute,  (4) 
Memory  Access,  and  (5)  Write  Back  of  registers.  The  execution  unit  contains  a  64-bit  to 
32-bit  funnel  shifter,  a  32-bit  ALU,  and  a  special  register  MD  that  is  used  by  the 
multiplication-step  and  division-step  instructions.  Branches  are  delayed  for  two  cycles.  In 
order  to  help  the  compiler  to  fill  these  two  delay  slots,  MIPS-X  has  the  option  to  change 
these  delay  instructions  into  NOOP  on  the  fly  ("squash"  the  instructions).  The  MIPS-X 
coprocessor  interface  treats  coprocessor  instructions  as  a  form  of  memory  operation  and 
uses  the  address  lines  to  transmit  the  coprocessor  instructions  to  the  coprocessor(s).  The 
most  significant  feature  of  MIPS-X  is  its  fast  cycle  time.  Unlike  the  SPUR  designers  who 
look  the  conservative  approach  to  increase  the  chance  of  getting  a  reliable  CPU,  the  MIPS- 
X  circuit  designer  used  very  aggressive  circuit  designs  to  lower  the  cycle  time. 

CRISP 

CRISP  [BDM87]  [DiM87]  [Ber87]  was  designed  at  AT&T  Bell  Laboratories.  It  was  imple¬ 
mented  in  l.75ja.m,  single-layer  metal,  double-layer  polysilicon,  CMOS  technology.  It  con¬ 
tains  170K  transistors  in  an  10.35mm  x  12.23inm  die  and  has  95  signal  pins,  20  power  pins, 
19  ground  pins,  and  6  test  pins.  The  peak  operating  frequency  is  16MHz  and  the  chip  dissi¬ 
pates  500m  W,  CRISP  can  be  divided  into  six  functional  blocks:  (1)  Input/Output,  (2)  Pre¬ 
fetch  Buffer,  (3)  Prefetch  and  Decode  Unit,  (4)  Decoded  Instruction  Cache,  (5)  Execution 
Unit,  and  (6)  Stack  Cache.  The  prefetch  buffer  is  a  512-byte  direct-mapped  cache  organized 
into  32  blocks.  The  decoded  instruction  cache  is  a  direct-mapped  cache  with  32  192-bit 
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entries.  Each  entry  is  fully  decoded  instruction.  The  Execution  Unit  uses  a  three-stage  pipe¬ 
line;  (1)  Operand  fetch,  (2)  ALU  operation,  and  (3)  register  writeback.  The  stack  cache  is 
implemented  with  two  32-word  byte-addressable  register  files.  Branches  can  be  "folded" 
into  other  instruction  and  is  executed  implicitly  as  part  of  other  instructions.  The  most 
significant  features  of  CRISP  are  its  architectural  innovations:  stack  cache  and  branch  fold¬ 
ing.  According  to  the  CRISP  designers,  stack  cache  access  time  is  as  fast  as  register  but  has 
the  advantage  of  software  transparency.  Branch  folding  enable  CRISP  to  execute  branch  in 
parallel  with  other  useful  instmctions.  This  makes  CRISP  the  first  microprocessor  that  can 
execute  multiple  instructions  per  cycle. 

The  three  commercial  processors  I  selected  are  R3000,  SPARC  SF9010IU,  and  MCSSOOO. 
Information  concerning  the  detailed  microarchitecture  of  the  commercial  processors  is  not  as 
readily  available  as  it  is  for  the  research  processors.  However,  there  are  stUl  enough  information 
for  me  to  judge  why  these  commercial  processors  deserve  special  attention. 

R3000 

R3000  [Kan88]  was  designed  at  MIPS  Computer.  It  is  implemented  in  1.2pm  CMOS  tech¬ 
nology  and  resides  in  a  172  pins  PGA.  The  peak  operating  frequency  is  25MHz.  R3000  has 
a  32-word  register  file.  There  is  no  on-chip  instruction  nor  data  cache.  Integer  multiplica¬ 
tion  is  supported  by  hardware  but  integer  division  is  supported  by  software  only.  Branches 
are  delayed  for  one  cycle.  The  most  significant  feature  of  R3000  is  its  speed-the  25MHz 
clock  rate  probably  makes  the  R3000  the  fastest  RISC  CPU  when  it  was  introduced. 

SPARC  SF9010IU 

SPARC  SF9010IU  [NaA88]  is  the  first  implementation  of  Sun  Microsystem’s  Scalable  Pro¬ 
cessor  Architecture  (SPARQ  [Gar88].  It  was  implemented  in  Fujitsu’s  high  speed  20K 
gate,  1.5pm,  256-pin  (156  of  them  are  signal  pins)  gate-array.  The  peak  operating  fre¬ 
quency  is  16.67MHz.  SF9010IU  has  a  120-word  register  file  organized  into  eight  global 
registers  and  seven  overlapped  windows,  an  dual-instruction  buffer,  and  uses  a  four-stage 
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pipeline.  The  four  stages  are:  (1)  Instruction  Fetch,  (2)  Decode,  (3)  Execute,  and  (4)  Write. 
Integer  multiplication  step  is  supported  by  hardware  but  integer  division  is  supported  by 
software  only.  Branches  are  delayed  for  one  cycle  and  the  delay  instruction  can  be 
"squashed"  depending  on  the  branch  outcome.  The  most  significant  feature  of  SPARC 
SF9010IU  is  its  simplicity.  It  is  so  simple  that  it  was  implemented  in  single  gate  array  in  a 
relatively  short  period  of  time. 

MC88000 

MC88000  [DRN88]  was  designed  at  Motorola.  It  was  implemented  in  l.Spm  CMOS  tech¬ 
nology  and  resides  in  a  181  pins  PGA.  The  peak  operating  frequency  is  20MHz.  MC88000 
has  a  32-word  register  file  but  there  is  no  on-chip  instruction  nor  data  cache.  Integer  multi¬ 
plication,  integer  division,  as  well  as  floating  point  arithmetics  are  aU  supported  by 
hardware.  The  most  significant  feature  of  MC88000  is  that  it  follows  a  supercomputer 
model  that  employs  a  scoreboard  similar  to  the  CDC  7600.  The  centerpiece  of  the  architec¬ 
ture  is  a  set  of  multiple  pipelined  functional  units  that  can  execute  independently  and  con¬ 
currently.  The  usage  of  these  functional  units  are  controlled  via  scoreboarding.  MC88000  is 
one  of  the  few  RISC  processors  that  uses  on-chip  resources  for  floating  point  hardware 
instead  of  for  instruction  cache. 

1.4.  Research  Goal  and  Thesis  Organization 

The  goal  of  this  thesis  is  not  to  formulate  the  theory  of  VLSI  design  but  to  explain  the  sci¬ 
ence  of  VLSI  design  to  someone  who  wants  to  build  a  microprocessor.  This  can  be  accomplished 
by  providing  a  quantitative  way  to  evaluate  and  a  systematic  way  to  design  micro  architecture. 
Since  this  research  is  based  on  the  SPRU  CPU  design  experience,  the  SPUR  CPU  microarchitec¬ 
ture  and  the  lessons  I  learned  must  be  introduced  first.  This  thesis  is  organized  as  follows: 

Chapter  2  Describes  the  SPUR  CPU  microarchitecture. 

Chapter  3  Discusses  the  lessons  I  learned  in  designing  the  SPUR  CPU, 
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Chapter  4  Develops  a  quantitative  way  to  evaluate  a  microprocessor’s  microarchitecture.  Dif¬ 
ferent  features  of  the  SPUR  CPU  microarchitecture  are  then  evaluated  as  examples. 

Chapter  5  Develops  a  systematic  approach  to  design  a  microprocessor  s  microarchitecture.  I 
illustrate  this  approach  by  using  it  to  recreate  the  SPUR  CPU  microarchitecture. 

Chapter  6  Summarisres  the  thesis.  I  also  say  a  few  words  about  what  I  think  the  future  will  be 
like  based  on  my  experience  in  SPUR, 
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Figure  1-4-1  The  First  Generation  Berkeley  RISC  Machine:  RISC  I 
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Figure  1-4-2  The  Second  Generation  Berkeley  RISC  Machine:  SOAR 
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•  If  •  I  • 


Figure  1-4-3  The  Third  Generation  Berkeley  RISC  Machine:  SPUR  CPU 
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Chapter  2 


THE  SPUR  CPU 
MICROARCHITECTURE 


If  things  are  too  complex,  I  can’t  understand  them. 

Seymour  Cray,  1976 


This  chapter  describes  the  microarchitecture  of  the  SPUR  CPU.  Section  2.1  gives  an  over¬ 
view  and  covers  all  the  important  features.  The  SPUR  CPU  can  be  divided  into  two  units:  the 
Instruction  Unit  and  the  Execution  Unit.  Section  2.2  describes  the  Instniction  Unit  and  Section 
2.3  describes  the  Execution  Unit.  FinaUy,  Section  2.4  describes  the  controller  that  controls  the 
SPUR  CPU.  The  following  naming  conventions  are  used  in  describing  the  microarchitecture: 

•  Register  names  start  with  an  upper  case  letter  and  the  rest  are  lower  case  except  to  improve 
readability.  Examples  are  Dstl  and  IfetPC. 

•  Functional  block  names  are  in  upper  case  letters  only.  Examples  are  ALU  and  EXT_INS. 

•  Signal  names  start  with  a  lower  case  letter  and  the  rest  are  lower  case  except  to  improve  rea¬ 
dability.  Examples  are  busA  and  trapType. 

The  goal  of  this  chapter  is  to  give  an  overall  picture  of  the  SPUR  CPU  microarchitecture,  so 
I  can  use  the  SPUR  CPU  as  an  example  in  latter  chapters.  Please  refer  to  Appendix  A  for  a 
detailed  discussion  of  the  microarchitecture. 

2.1.  The  SPUR  CPU  Microprocessor 

The  SPRU  CPU  is  a  third  generation  Berkeley  RISC  microprocessor,  and  this  section 
describes  the  important  features  of  the  SPUR  CPU  microarchitecture. 
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2.1.1.  Overview  of  the  SPUR  CPU 

The  SPUR  CPU  is  similar  to  RISC  II  that  it  has  a  reduced  instruction  set  and  a  138-register 
register  file  organized  into  10  global  registers  and  eight  overlapped  register  windows  (see  Appen¬ 
dix  A).  However,  unlike  RISC  II,  the  SPUR  CPU  also  has  a  512-byte  on-chip  instruction  cache,  a 
four-stage  pipeline,  a  cache  controller  interface,  and  a  parallel  coprocessor  interface.  Internally, 


Register-Register:  Rd,  Rsl,  Rs2 _ 

opcode  I  Rd  I  Rsl  [o|  Rs2  |  unused 
31  24  19  14  8  0 

Register-Register:  Rd,  Rsl,  Immediate _ 

opcode  J  Rd  j  Rsl  [l|  Immediate 

31  24  19  14  0 

Store:  Rs2,  Rsl,  Immediate _ 

opcode  j  High  Imm|  Rsl  |lj  Rs2  |  Low  Imm 
31  24  19  14  8  0 

Compare-Branch:  Rsl,  Rs2 _ 

opcode  I  Cond  |  Rsl  Jo|  Rs2  [  Branch  Offset 

31  24  19  14  8  0 

Compare-Branch:  Rsl,  Short  Imm _ 

opcode  J  Cond  J  Rsl  |l|ShortImmJ  Branch  Offset 

31  24  19  14  8  0 

Compare-Branch:  Rsl,  Tag  Imm _ 

opcode  I  Cond  J  Rsl  [  Tag  Imm  J  Branch  Offset 

31  24  19  14  8  0 

Call,  Jump:  Word  Address _ 

opcode  j  Word  address  within  currect  segment 

31  27  0 


Figure  2-1-1  SPUR  Instruction  Formats 

Register-Register  instructions  use  the  Rsl-Rs2  or  RsUImmediate  pair  to  specify  the  source 
operands  and  the  result  is  stored  into  the  register  specified  by  Rd.  For  Store  instructions,  Rs2  con¬ 
tains  the  value  to  be  stored  and  the  effective  address  is  formed  by  adding  Rsl  to  the  concatena¬ 
tion  of  the  High  Imm  and  Low  Imm  fields.  Compare-Branch  instructions’  formats  are  selected  by 
the  Cond  field.  The  three  formats  are:  (1)  Rsl-Rs2  format-compare  the  two  registers’  contents, 
the  two  registers’  type-tags,  or  both  contents  and  tags;  (2)  Short  Imm  format-compare  the  zero 
extension  of  the  Short  Imm  field  with  Rsl’s  contents;  (3)  Tag  Imm  format-compare  the  6-bit  Tag 
Imm  with  Rsl’s  type-tag. 
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the  SPUR  CPU  uses  a  combination  of  a  byte  extractor/insertor  and  a  shifter  instead  of  the  more 
complicated  barrel  shifter.  It  also  uses  an  extra  adder  to  calculate  the  branch  address  to  support 
the  1 -cycle  Compare-Branch  instructions.  Finally,  eight  extra  tag  bits  are  attached  to  each  32-bit 
register  to  support  LISP.  This  makes  SPUR  CPU  register  40  bits  wide  (see  Figure  2-1-2).  The 
SPUR  CPU  was  fabricated  in  l.bjim,  double  layer  metal,  CMOS  technology.  The  die  size  is 
1.15cm  X  1.15cm  and  is  packaged  in  a  208  pin  pin  grid  array. 

The  SPUR  CPU  is  a  register-to-register  machine  in  which  load  and  store  are  the  only  type 
of  instructions  that  access  memory.  The  effective  address  of  load  and  store  instructions  can  either 
be  the  sum  of  two  registers  or  the  sum  of  one  register  and  an  immediate  constant.  The  SPUR 
memory  system  does  not  support  byte  addressing  and  the  two  least  significant  bits  of  the  32-bit 
address  are  always  ignored  by  the  memory  system. 

The  SPUR  CPU  modes  of  operation  can  be  divided  into  two  orthogonal  sets:  (1)  User  vs. 
Kernel  and  (2)  Virtual  vs.  Physical.  Only  when  the  SPUR  CPU  is  in  kernel  mode  can  privileged 
instructions  be  executed.  In  virtual  mode,  data  and  instruction  addresses  generated  by  the  SPUR 
CPU  are  interpreted  by  the  SPUR  memory  system  as  virtual  addresses.  In  physical  mode, 
addresses  are  interpretated  as  physical  address,  which  is  useful  in  debugging  and  bootstrapping 
the  system.  The  mode  of  operation  is  controlled  by  writing  different  bit  patterns  into  the  Kernel 
Processor  Status  Word-Kpsw  (see  Appendix  A). 

2.1.2.  Instruction  Formats 

SPUR  CPU  instruction  set  [Tay85]  (see  Appendix  A-3)  can  be  grouped  into  four  genetic 
instruction  types:  Register-Register,  Store,  Compare-Branch,  and  Call-Jump.  The  formats  of 
these  genetic  types  are  shown  in  Figure  2-1-1. 

Load  and  Return  type  instructions  are  special  cases  of  Register-Register  in  which  (Rsl  + 
Rs2)  or  (Rsl  +  Immediate)  are  used  as  the  the  effective  address.  The  Rd  field  specifies  the  regis¬ 
ter  to  be  loaded  for  the  Load  type  instruction  and  is  not  used  for  Return  type  instruction.  In  order 
to  modify  any  special  register,  its  contents  must  first  be  read  into  a  general  purpose  register  and 
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type 

tag 

address  or  data  (Fixnum  or  Character) 


2  bits  6  bits 


32  bits 


Figure  2-1-2  SPUR  Pointer 

A  SPUR  pointer  is  a  40-bit  word  composed  of  a  32-bit  address,  a  6-bit  type  tag,  and  a  2-bit  gen¬ 
eration  number.  These  three  parts  are  logically  independent  The  6-bit  type  tag  allows  up  to  64 
possible  types.  The  SPUR  CPU  hardware  only  recognizes  two  data  types:  Fixnum,  which  is  an 
integer  that  fits  in  a  32-bit  word,  and  Character.  In  SPUR,  Fixnum  and  Character  are  not  refer¬ 
enced  indirectly  through  pointers  but  are  represented  as  immediate  data  in  the  32-bit 
"address"field.  Besides  these  two  immediate  data  types,  the  SPUR  CPU  hardware  also  recog¬ 
nizes  two  pointer  types:  Cons  and  Nil. 


then  written  back  to  the  special  register  after  the  modification. 

2.1.3.  LISP  Support 

The  SPUR  CPU  supports  LISP  by  three  types  of  tag  checking  [Tay86]  [ZHH88]:  data  type 
checking  for  general  operations,  pointer  type  checking  for  list  operations,  and  generation  check¬ 
ing  for  garbage  collection. 

Data  Type  Checking.  In  a  runtime  typing  system  such  as  LISP,  the  type  of  a  variable  is 
not  known  at  compile  time  and  can  change  during  the  course  of  execution.  Therefore,  every 
object’s  type  must  be  stored  within  the  object  itself  or  in  aU  the  pointers  that  point  to  it  SPUR 
stores  the  type  in  the  pointer  because  the  type  information  is  then  available  before  the  memory 
reference.  Figure  2-1-2  shows  the  SPUR  pointer  which  can  be  stored  in  a  40-bit  CPU  general  pur¬ 
pose  register.  The  SPUR  CPU  instructions  can  be  divided  into  two  groups  with  respect  to  data¬ 
type  checking: 

(1)  Data-type  checking  is  not  performed  and  the  tag  field  is  ignored.  LOAD  is  an  example. 

(2)  Data-type  checking  is  done  in  parallel  with  the  data  operation  and  traps  conditionally.  For 
example,  ADD  will  trap  if  either  operand  is  not  a  Fixnum. 
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SPUR  CONS  Cell: 


CAR 

CDR 


2  bits  6  bits  32  bits 


Type 

address  or  data  (Fixnum  or  Character) 

type  . 

address  (Cons  or  Nil)  or  data 

Example: 

SPUR  List  Representation:  (33  12  (18  5)) 


Figure  2-1-3  Pointer-Type  Checking 

Fixnums  33,  12,  18,  and  5  are  stored  as  immediate  data.  Assume  the  shaded  CONS  cell  is  al¬ 
ready  in  the  CPU  registers  R5  and  R6  and  all  other  CONS  cells  are  still  in  memory.  More 
specificly,  assume  the  dotted  CONS  cell  is  in  memory  location  n  and  n  +  1,  then  the  operation: 
C.4R  R6  <=>  CXR  (R6)  =>  Load  at  location  m  (data  Fixnum  12); 

CDR  R6  <=>  CXR  (R6+1)  =>  Load  at  location  m+1  (pointer  Cons  n); 

CAR  R5  <=>  CXR  (R5)  =>  Trap  because  R5  is  a  Fixnum. 

In  order  to  simplify  the  figure,  this  example  does  not  show  the  generation  portion  of  the  tag. 


Pointer  Type  Checking.  Figure  2-1-3  shows  how  SPUR  represents  a  LISP  list  element  by 
a  pair  of  consecutive  storage  elements  called  a  CONS  cell-one  represents  the  CAR  pointer  and 
the  other  represents  the  CDR  pointer.  Since  the  SPUR  CPU  is  a  load-store  machine,  CAR  and 
CDR  operations  are  similar  to  the  load  operation.  However,  CAR  and  CDR  operations  are 
defined  in  Common  LISP  to  work  only  for  a  Cons  pointer  (points  to  another  CONS  cell)  or  Nil 
pointer  (points  to  nothing).  The  SPUR  CPU  supports  this  feature  by  a  special  load  instruction:  the 
CXR  instruction  performs  the  load  in  parallel  with  the  pointer-type  checking  for  Cons  or  Nil. 

Generation  Checking.  SPUR  uses  the  generation  scavenging  garbage  collection  algorithm 
[Ung84]  [Lill83].  Tlris  is  based  on  die  observation  that  the  longer  an  object  has  been  in  use,  the 
more  likely  it  is  to  continue  to  be  in  use.  In  SPUR,  objects  arc  separated  dynamically  into  three 
generations.  Each  object’s  generation  is  recorded  in  its  two  generation  bits.  For  simplicity.  Figure 


Chapter  2:  The  SPUR  CPU  Microarchitecture 


19 


Figure  2-1-4  Generation  Checking 

In  this  simplified  drawing,  objects  are  separated  into  two  generations.  New  objects  are  allocated 
in  the  younger  generaUon  and  move  to  the  older  generation  if  they  survive  a  garbage  collection. 
The  garbage  collector  restricts  its  attention  to  the  younger  generations  ^  much  as  possible.  In 
order  to  identify  new  objects  that  are  currently  in  use,  a  "Remembered  List"  is  used  to  keep  track 
of  all  the  older  objects  that  contain  pointers  to  the  younger  generation. 


2-1-4  only  shows  two  generations.  The  SPUR  CPU  has  a  special  store  instruction  which  com¬ 
pares  the  generation  number  of  the  operands  in  parallel  with  the  store  and  traps  to  a  routine  that 
updates  the  "Remembered  List"  whenever  necessary. 

2.1.4,  The  Basic  In gredients-B locks,  Clocking,  and  Pipeline 

Block  Diagram.  Figure  2-1-5  can  be  considered  as  an  abstract  floor  plan  which  shows  the 
relative  position  of  each  block  within  the  SPUR  CPU.  The  dimension  of  each  block,  however,  is 
not  drawn  in  scale. 

Clocking.  The  SPUR  CPU  uses  a  four-phase  non-overlap  clock,  that  is  each  cycle  consists 
of  four  phases.  Each  phase  has  an  nominal  duration  of  18  ns  and  there  are  7  ns  nominal  non¬ 
overlap  time  between  each  phase.  This  makes  the  SPUR  CPU  cycle  time  100  ns. 

Pipeline.  The  SPUR  CPU  uses  a  uniform  four-stage  pipeline  (Figure  2-1-6).  Each  pipe 
stage  corresponds  to  one  clock  cycle  and  all  instructions  take  four  cycles  to  finish.  The  four  stages 


are: 
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Instruction 

Unit 


bus  PC 


busi 


Execution ,  Unit 
Special  Registers  i 

►  Upper  Oatapath 


Program 

Counters 


Control  Unit 


Lower  Datapath 

Operand 

Supply  Unit 

Trap  Logic 

Functional  Unit 

Cache 

Controller 

Interface 


Figure  2-1-5  The  SPUR  CPU  Abstract  Block  Diagram 


The  SPUR  CPU  can  be  divided  into  two  units:  Instruction  Unit  and  the  Execution  Unit.  The  Exe¬ 
cution  Unit  can  be  further  divided  into  four  parts:  the  Cache  Controller  Interface,  which  is  out  of 
the  scope  of  this  chapter,  the  Lower  Datapath  that  handles  all  the  data  manipulating  operations, 
the  Upper  Datapath  that  handles  all  the  the  program  conirol  operations,  and  the  Control  Unit  that 
controls  the  Upper  Datapath  and  Lower  Datapath. 


Ifet  Instruction  fetch.  The  instruction  is  delivered  to  the  Execution  Unit. 

Exec  Register  read  and  instruction  execution  for  register-to-register  instructions  or  register 
read  and  effective-address  calculation  for  memory  access  instructions. 

Mem  Memory  access  for  all  memory  access  instructions. 

Wr  Register  write.  Write  results  back  to  the  register  file. 

Unlike  the  RISC  II  3-stage  pipeline,  the  extra  Mem  stage  in  the  SPUR  CPU  pipeline  eliminates 
the  need  to  stall  the  pipeline  whenever  a  load  instruction  is  executed.  This  was  considered  to  be 
an  important  at  the  beginning  of  the  project  because  the  frequency  of  Loads  was  expected  to  be 
higher  in  LISP  than  C.  Similar  to  RISC  II,  a  branch  conflict  in  the  SPUR  CPU  pipeline  is 
resolved  by  a  delay  branch  with  one  instruction  in  the  delay  slot.  Delay  branch  with  the  option  to 
cancel  the  instmction  in  the  delay  slot  was  considered  but  not  implemented  due  to  the 
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Figure  2-1-6  The  SPUR  CPU  Pipeline 

Register-to-Register  instruction  (12)  always  finishes  its  execution  at  the  end  of  its  Exec  stage. 
However,  to  avoid  register  write  conflicts  with  possible  previous  Load  instructions  (10,  II),  12’ s 
result  is  not  written  into  the  register  file  until  its  Wr  stage.  Because  of  this  delay,  two  temporary 
registers,  Dstl  and  Dst2,  are  needed  to  store  the  result  at  the  end  of  its  Exec  and  Mem  stage 
respectively.  Instructions  13  and  14,  will  not  be  able  to  read  12’ s  result  from  the  register  file  but 
the  result  can  be  read  directly  from  Dstl  and  Dst2.  This  is  referred  to  as  internal  forwarding. 


complexities  involving  interactions  of  internal  forwarding,  pipeline  suspension,  and  the  coproces¬ 
sor  interface. 

As  illustrated  in  Figure  2-1-6,  data  conflicts  in  the  SPUR  CPU  pipeline  are  resolved  by 
internal  forwarding  in  which  operands  are  supplied  by  temporary  registers  Dstl  or  Dst2.  For  the 
Load  instruction  (10  in  Figure  2-1-6)  the  value  to  be  loaded  comes  from  the  external  data  bus  at 
the  end  of  the  Mem  stage  and  goes  directly  into  temporary  register  Dst2.  Internal  forwarding  via 
Dstl  is  therefore  impossible  for  the  instruction  immediately  following  the  Load.  In  order  to  sim¬ 
plify  the  internal  forwarding  logic,  the  destination  register  of  the  Load  instruction  is  defined  to 
have  an  unknown  value  for  the  instmetion  immediately  following  the  Load. 


Chapter  2:  The  SPUR  CPU  Microarchitecture 


22 


2.2.  Instruction  Unit 

The  Instruction  Unit  provides  the  SPUR  CPU  a  512-byte  direct-mapped  on-chip  instruction 
cache.  The  Instruction  Unit’s  organization,  control,  and  operation  are  discussed  in  Section  2.2.1, 
Section  2.2.2,  and  Section  2.2.3  respectively.  The  implementation  of  the  Instruction  Unit  is 
described  in  Rich  Duncombe’s  Master  of  Science  report  [Dun86]. 

2.2.1.  Instruction  Unit  Organization 


Figure  2-2-1  Instruction  Unit  Block  Diagram 

The  512-byte  direct-mapped  instruction  cache  is  organized  into  sixteen  blocks  with  eight  instruc¬ 
tions  per  block.  The  Execution  Unit  requests  an  instruction  by  placing  an  address  onto  busPC  and 
the  Instruction  Unit  delivers  the  instruction  via  busl.  Since  all  SPUR  instructions  are  4  bytes 
long,  only  the  upper  30  bits  of  busPC  are  used  to  access  this  cache.  The  Instruction  Unit  is  con- 
troll^  by  two  finite  state  machines:  the  Fetch  Finite  State  Machine  (FET_FSM)  and  the  Prefetch 
Finite  State  Machine  (PF_FSM).  Besides  acting  as  an  internal  instmction  cache,  the  Instruction 
Unit  also  provides  internal  instructions  to  simply  the  control  of  the  Execution  Unit 
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The  Instruction  Unit,  as  shown  in  Figure  2-2-1,  contains  a  512-byte  direct-mapped  on-chip 
instruction  cache  [HU87].  This  cache  is  different  from  a  regular  cache  that  during  a  cache  miss, 
only  the  missing  instruction  rather  than  the  full  eight-instruction  cache  block  is  brought  immedi¬ 
ately  into  the  cache  from  the  next  higher  level  of  memory.  After  the  Execution  Unit  has  received 
this  missing  instruction  and  has  resumed  its  normal  operation,  the  Instruction  Unit  wiU  try  to  pre¬ 
fetch  the  rest  of  the  block  into  the  cache  one  instruction  per  cycle  starting  at  the  instruction 
immediately  after  the  missing  instruction.  In  other  words,  each  eight-instruction  block  in  Figure 
2-2-1  is  divided  into  eight  one-instruction  sub-blocks.  On  a  cache  miss,  the  requested  sub-block  is 
brought  into  the  cache  immediately  and  a  prefetching  process  is  triggered  to  bring  the  rest  of  the 
sub-blocks  into  the  cache.  Under  ideal  conditions,  prefetching  is  fast  enough  that  after  the  first 
miss  in  a  block,  there  will  not  be  any  more  cache  misses  within  that  block  as  long  as  the  Execu¬ 
tion  Unit  is  executing  sequential  code. 

Prefetching  has  the  lowest  priority  among  all  cache  access  because  the  instruction  being 
piefetched  may  not  be  needed  by  the  Execution  Unit  at  all.  Therefore,  prefetching  is  interrupted 
whenever  the  Execution  Unit  performs  a  data  access  Goad  or  store)  or  whenever  there  is  a  new 
miss  in  the  Instruction  Unit.  Furthermore,  if  the  instruction  being  prefetched  is  not  in  the  external 
cache,  it  will  not  cause  an  external  cache  miss.  The  prefetcher  simply  move  onto  the  next 
instruction.  Since  prefetching  is  not  always  successful,  each  instruction  must  have  a  "Word  Valid 
Bit"  (WV)  to  indicate  its  validity. 

The  Instruction  Unit  also  plays  a  role  in  simplifying  the  control  of  the  Execution  Unit  by 
providing  internal  instructions  to  "fool"  the  Execution  Unit  pipeline.  For  example,  the  internal 
instructions  trap_call  and  rd __pc  are  used  to  simplify  trap  handing.  This  will  be  explained  in  Sec¬ 
tion  2.3.3.  The  miss  internal  instruction  is  used  whenever  the  Execution  Unit  requests  an  instruc¬ 
tion  that  is  not  currently  in  the  Instruction  Unit.  This  case  is  discussed  in  Section  2.2.3. 
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Figure  2-2-2  Simplified  State  Diagrams 

The  diagram  on  the  left  is  the  state  diagram  for  the  Fetch  Finite  State  Machine  and  the  one  on  the 
right  is  for  the  Prefetch  Finite  Stale  Machine.  In  the  Fetch  Finite  State  Machine,  the  input  signal 
slartFetch  is  a  composite  signal: 

stariFetch  =  notSuspend  &  notWrKpsw  &  (ibMiss  or  flush) 

Outputs  of  these  machines  are  not  shown  here  for  simplicity.  All  but  one  outputs  are  used  to  con¬ 
trol  the  datapath  of  the  Instruction  Unit.  The  exception  is  startPF.  It  is  an  output  of  the  Fetch 
Finite  State  Machine  and  it  triggers  the  Prefetch  Finite  State  Machine; 

StartPF  =  MEM_BUSY  or  (NORMAL  and  notSuspend  and  (ibMiss  or  flush)) 


2.2.2.  Instruction  Unit  Control 

Initially,  we  envisioned  a  single  finite  state  machine  controlling  the  entire  Instruction  Unit. 
This  turned  out  to  be  a  difficult  design  task  and  tltc  result  was  so  hard  to  understand  that  we  had 
little  confidence  in  its  correctness.  Further  investigation  revealed  that  prefetching  should  occur  in  - 
parallel  with  other  Instruction  Unit  operations  and  is  quite  autonomous.  This  gave  us  the  idea  of 
delegating  the  control  to  two  independent  finite  state  machines:  one  for  prefetching,  the  Prefetch 
Finite  State  Machine,  and  one  for  the  rest  of  the  operations.  For  the  lack  of  a  better  name,  it  is 
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Inputs  valid  during  Phil,  2,  or  3  respectively 


Outputs  valid  during  Phil,  2,  3,  or  4  respectively 


Figure  2-2-3  Generic  Structure  of  the  I-Unit  Finite  State  Machines 

The  state  of  the  finite  state  machine  is  determined  by  the  contents  of  the  Present  State  Register 
(shaded  in  this  figure).  Since  the  output  of  the  Present  State  Register  is  updated  every  and  is 
stable  by  (t)l,  a  state  begins  in  (j)!.  The  state  informations  state _cl  must  be  used  to  generate  out¬ 
puts  that  are  valid  during  <t>4  to  prevent  race  condition.  Similarly,  state_c4  must  be  used  to  gen¬ 
erate  outputs  that  are  valid  during  <|(1.  For  outputs  that  are  valid  during  ^2  and  (j)3,  either  state _cl 
or  state_c4  can  be  used.  Output  of  this  finite  state  machine  is  a  function  of  the  Present  State  and 
any  inputs  that  are  valid  during  ^  (N=l,  2,  or  3)  can  affect  outputs  that  are  valid  during  +1 
of  the  same  cycle.  For  example,  outputs  Aat  are  valid  during  ((>4  can  be  a  function  of  inputs  that 
are  valid  during  <1)1,  <t»2,  and  <1)3. 


called  the  Fetch  Finite  State  Machine.  The  simplified  state  diagrams  of  these  two  finite  state 
machines  are  shown  in  Figure  2-2-2. 

Both  finite  state  machines  only  have  a  small  number  of  states  and  are  implemented  by  the 
generic  structure  shown  in  Figure  2-2-3.  The  State  Logic  and  Output  Logic  blocks  are  imple¬ 
mented  by  PLAs.  There  are  two  reasons  why  they  not  combined  into  one  single  block  (Figure  2- 
2A)  as  suggested  by  most  "classical"  VLSI  text  books.  First,  separating  them  makes  the 
designer’s  job  easier.  More  importandy,  in  this  arrangement  the  outputs  depend  on  the  Present 
State  and  any  inputs  that  are  valid  during  phase  N  (N=l,  2,  or  3)  can  affect  outputs  that  are  valid 
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Inputs  valid  during  Phil,  2,  or  3  respectively 


Outputs  valid  during  Phil ,  2, 3,  or  4  respectively 


Figure  2-2-4  Classical  Implementation  of  Finite  State  Machine 

In  this  classical  implementation,  the  State  &  Output  Logic  evaluates  the  next  state  and  output 
during  Since  the  output  and  next  state  are  evaluated  together  and  the  results  are  latched  into 
the  Output  &  Present  State  Register  at  the  end  of  4>4.  the  outputs  can  only  be  aff^t  by  the  previ¬ 
ous  state  and  inputs  from  the  previous  cycle.  In  contrast,  the  outputs  of  the  organization  in  Figure 
2-2-3  are  functions  of  the  present  stale  and  inputs  from  any  previous  phase  of  the  current  cycle. 
Notice  that  a  latch  must  be  placed  between  the  Output  register  and  the  output  signals  that  are 
valid  during  (|>4  to  prevent  race  condition. 


during  phase  N+1  of  the  same  cycle.  On  the  other  hand,  if  we  combine  the  Output  Logic  and 
State  Logic  as  tradition  dictates  (Figure  2-2-4),  the  output  signals  can  only  depend  on  the  previ¬ 
ous  state  and  the  inputs  from  the  previous  cycle  rather  than  the  current  state  and  the  inputs  from 
the  previous  clock  phase.  This  will  increase  the  latency  of  the  Instruction  UniL 

Since  we  did  not  implement  the  single  finite  state  machine  version  of  the  control,  it  is  hard 
to  judge  the  advantages  of  separating  the  control  into  two  finite  state  machines  in  terms  of  imple¬ 
mentation  metrics  such  as  area,  power  consumption,  and  number  of  logic  gates.  However,  this 
separation  greatly  simplifies  the  design  and  verification  effort  by  allowing  us  to  focus  our  atten¬ 
tion  on  one  thing  at  a  time.  This  illustrates  an  important  point  in  VLSI  design:  logic  optimization 
is  important  as  long  as  you  are  still  trying  to  meet  implementation  constraints.  Once  these  con¬ 
straints  are  met,  continuing  optimization  can  be  counteiproductive  not  only  because  the  design 
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(Word  Address  07  g) 
Internal 
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Ifet 


Ex 
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(Word  Address  1  Ig)  13  Ifet  Ex 
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Figure  2-2-5  Fetch  and  Prefetch-Execution  Unit’s  Perspective 

Assume  instructions  10, 11,12,  and  13  are  in  consecutive  word  addresses  (octal)  06, 07,  10,  and  1 1 
and  only  10  is  currently  (Cycle  TO)  in  the  Instruction  Unit  cache  array.  Under  ideal  conditions 
(most  of  the  time),  the  Instruction  Unit  will  be  able  to  prefetch  II  in  time.  However,  when  the 
Execution  Unit  requests  12,  the  block  boundary  is  crossed  and  a  miss  occurs.  Instead  of  suspend¬ 
ing  everything  in  the  Execution  Unit’s  Pipeline,  the  Insumclion  Unit  inserts  internal  instruction 
miss  into  the  pipeline  such  that  both  10  and  11  can  proceed  while  12  is  being  fetched.  For  the  In¬ 
struction  Unit’s  Perspective,  please  refer  to  Table  2-2-1. 


lime  could  be  spent  on  something  else,  but  also  because  it  may  make  the  design  harder  to  under¬ 
stand  and  thus  harder  to  verify  and  modify. 

2.2.3.  Instruction  Unit  Operation 

Figure  2-2-5  shows  Execution  Unit’s  view  of  how  a  miss  in  the  Instruction  Unit  cache  array 
is  handled  under  ideal  conditions.  After  the  Execution  Unit  is  fooled  by  the  internal  miss  instruc¬ 
tions  during  cycles  T2  and  T3,  the  Instruction  Unit  tries  to  fetch  12  and  prefetch  instructions  in 
the  same  block  as  12  from  the  external  cache  during  cycles  T4  and  T5.  This  is  further  illustrated 
in  Table  2-2-1.  Notice  that  in  Figure  2-2-5  during  Cycle  T2  and  T3,  instruction  10  and  II  are 
allowed  to  proceed.  This  is  necessary  to  prevent  deadlock  since  instructions  10  or  11  could  be  a 
Load  or  Store  type  instruction  which  would  start  accessing  the  external  cache  at  die  end  of  their 
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Cycle 

Figure  2-2-5 

Fetch  Finite  State  Machine 

State  &  Actions 

Prefetch  Finite  State  Machine 

State  &  Actions 

NORMAL: 

PREFETCH: 

TO 

Deliver  10  (06) 

Prefetch  ((X)) 

Receive  and  write  11  (07)  into  cache 

NORMAL: 

PREFETCH: 

T1 

Deliver  11  (07) 

Prefetch  (01) 

Instruction  (00)  is  not  written  (Note  1) 

NORMAL: 

PREFETCH: 

T2 

Deliver  "miss"  opcode 

Fetch  12  (10)  from  external  cache 

Prefetch  is  blocked  by  Fetch 

MEMPEND:  (Note  2) 

WAITING: 

T3 

Deliver  "miss"  opcode 

As  soon  as  12  is  received. 

Receive  and  write  12  (10)  (Note  3) 

prefetch  12  (11) 

INSVALID: 

PREFETCH: 

T4 

Deliver  12  (10)  (Note  4) 

Prefetch  (12) 

Receive  and  writes  13  (1 1) 

NORMAL: 

PREFETCH: 

T5 

Deliver  12  (10) 

Prefetch  (13) 

Receive  and  Write  (12) 

Table  2-2-1  Fetch  and  Prefetch-Instruction  Unit’s  Perspective 

Notice  that  prefetching  does  not  cross  block  boundary.  This  is  illustrated  by  Cycle  TO  in  which 
the  prefetcher  has  already  "wrap"  around  and  start  prefetching  instruction  at  word  address  00  in¬ 
stead  of  instruction  at  word  address  010  (octal). 

Notes: 

1 .  Assume  instruction  (00)  is  already  in  the  cache  -  no  need  to  write. 

2.  Assume  the  external  cache  is  not  busy. 

Otherwise,  it  will  go  to  MEM_BUSY  and  wait. 

3.  Assume  the  external  cache  can  deliver  the  insunction  in  one  cycle. 

Otherwise,  it  will  stay  in  MEMPEND  until  it  receives  12. 

4. 12  is  not  received  until  phi3  of  Cycle  T3.  Therefore  it  cannot 
be  delivered  to  the  Execution  Unit  until  Cycle  T4. 


Exec  stage.  If  their  already  started  cache  access  are  not  allowed  to  finish,  the  I-Unit  cannot  start 
the  fetch  for  12  because  the  external  cache  in  SPUR  is  not  separated  into  data  and  instruction 
caches.  A  deadlock  would  have  occured  because  the  Execution  Unit  would  wait  for  12  but  the 
Instruction  Unit  could  not  fetch  12  until  the  cache  is  free. 

Figure  2-2-5  and  Table  2-2-1  show  the  ideal  case  in  which  the  cache  miss  is  handled  in  two 
cycles  and  the  prefetch  of  13  is  successful.  In  practice,  either  10  or  II  can  block  off  the  cache 
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access  path  and  the  Fetch  machine  must  first  go  into  MEMBUSY  state  during  T3  and  wait  for 
cache  to  be  free  before  starting  the  fetch  of  12  and  entering  the  MEMPEND  state.  Once  in  the 
MEMPEND  state,  the  external  cache  may  not  be  able  to  deliver  the  12  within  one  cycle  and  the 
Fetch  machine  must  stay  there  until  12  is  received.  In  other  words,  the  Instruction  Unit  may  take 
more  than  two  cycles  to  recover  from  the  miss  and  more  than  two  miss  instructions  must  be 

inserted. 

The  prefetching  can  be  turned  off  by  setting  a  bit  in  the  Kernel  Processor  Status  Word.  In 
that  case,  the  Prefetch  Finite  State  machine  will  remain  in  the  IDLE  state  and  will  not  prefetch 
any  instructions.  Furthermore,  the  whole  Instruction  Unit  can  be  disabled  and  none  of  the  instruc¬ 
tions  will  be  cached.  If  the  Instruction  Unit  is  disabled,  the  Prefetch  Finite  State  machine  will 
remain  in  the  IDLE  state  and  the  Fetch  Finite  State  machine  will  continuously  cycle  between 
NORMAL,  MEM_PENDING  (or  MEM.BUSY),  and  INS.VALID  states.  In  other  words. 
Instruction  Unit  disabled  is  just  a  special  case  in  which  every  access  to  the  Instruction  Unit 
results  in  a  miss. 

23.  Execution  Unit 

The  Execution  Unit  executes  the  instructions  delivered  by  the  Instruction  Unit.  The  Execu¬ 
tion  Unit’s  datapath  organization  is  discussed  in  Section  2.3.1.  The  Execution  Unit  s  operation 
under  normal  and  adverse  conditions  are  discussed  in  Section  2.3.2  and  Section  2.3.3,  respec¬ 
tively.  The  implementation  of  the  Execution  Unit’s  datapath  is  described  by  Dave  Lee  [Lee86]. 

23.1.  Execution  Unit  Datapath 

The  Execution  Unit  datapath  can  be  divided  into  two  parts:  the  Lower  Datapath  and  the 
Upper  Datapath.  The  Lower  Datapath  performs  all  the  register-to-register  operations  and  can  be 
further  divided  into  an  Operand  Supplier  and  a  Functional  Unit.  Here  is  a  brief  description  of 
each  block  in  the  Lower  Datapath,  from  left  to  right  as  shown  in  Figure  2-3-1  (please  refer  to  the 
naming  conventions  described  in  the  beginning  of  Qiapter  2,  P.14): 


Chapter  2:  The  SPUR  CPU  Microarchitecture 


30 


Figure  2*3-1  The  SPUR  CPU  Lower  Datapath 

The  SPUR  CPU  Lower  Datapath  is  40  bits  wide-the  upper  eight  bits  (39-32)  are  for  tags.  BUS- 
BUFA  &  B  serves  as  operands  buffers.  Everything  to  its  left  can  be  considered  as  the  Operand 
Supplier  and  everything  to  its  right  as  the  Functional  Unit.  BusA,  busB,  busA2,  and  busB2  route 
operands  from  the  Operand  Supplier  to  the  Functional  Unit.  The  result  of  the  computation  is 
routed  back  to  the  Operand  Supplier  via  busD.  BusL  connects  the  Lower  Datapath  to  the  data 
pads  and  busS  connects  the  Lower  Datapath  to  the  Upper  Datapath  and  the  Memory  Address 
Lalchcs-Mals. 


REGISTER  FILE  is  a  138-word,  40-bit,  dual-port  read,  but  single-port  write  register  file. 
It  is  organized  into  ten  global  registers  and  eight  overlapping  register  windows  (see  Appen¬ 
dix  A).  Register  RO  is  hardwired  to  zero. 

Dst2<39:0>  is  the  second  temporaiy  register  for  the  4-stage  pipeline.  The  result  of  every 
instruction  that  requires  writing  to  tire  register  file  is  saved  here  at  the  end  of  the  Mem  stage. 

Dstl<39:0>  is  the  first  temporary  register  for  the  4-stage  pipeline.  The  result  of  every 
instruction  that  requires  writing  to  the  register  file  (except  Load)  is  saved  here  at  the  end  of 
the  Exec  stage. 
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IF_LOGIC  detects  data  hazards  and  instructs  Dstl  or  Dst2  or  both  to  override  the  REGIS¬ 
TER  FILE  and  supply  the  operand  or  operands. 

Mbr<39:0>  is  the  memory  buffer  register.  It  stores  the  data  to  be  written  to  external 
memory. 

MUXs  route  immediate  constants  into  the  datapath  as  operands. 

BUSBUFA  &  B<39:0>  latch  in  busA  and  busB,  respectively  during  <J)1  and  drive  busA2 
and  busB2  during  4i2.  They  serve  as  buffers  between  the  Operand  Supplier  (items  listed 
above)  and  the  Functional  Unit  (items  listed  below). 

EXT_INS<39:0>  is  the  byte  extractor  and  inserter.  The  SPUR  CPU  uses  this  together  with 
the  SHIFTER  to  replace  the  more  traditional  32-bit  barrel  shifter. 

SHIFTER<31:0>  is  a  maximum  of  3-bit  left  shift  and  1-bit  arithmetic  and  logic  right  shift. 


Figure  2-3-2  The  SPUR  CPU  Upper  Datapath 

The  SPUR  CPU  Upper  Datapath  is  30  bits  wide  because  its  main  function  is  to  provide  instruc¬ 
tion  address  whose  two  LSBs  are  always  ignored  by  the  word  addressing  SPUR  memory  system. 
BusI,  which  contains  the  instruction,  provides  the  immediate  offsets  for  all  the  Call,  Jump,  and 
Compare-Branch  type  insuuctions.  BusPC  contains  the  instruction  address  to  be  sent  to  the  In¬ 
struction  Unit.  BusS,  as  explained  before,  connects  the  Lower  Datapath,  the  Upper  Datapath,  and 
the  Memory  Address  Latches-Mals. 
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ALU<31:0>  performs  A  +  B.  A  -  B,  A  XOR  B,  A  AND  B,  and  A  OR  B  ftmctions. 
BRANCH  COND  evaluates  the  branch  conditions  for  all  the  Compare-Branch  type  instruc¬ 
tions. 

BUSSTOD<31:0>  is  a  buffer  between  the  Upper  Datapath  and  the  Lower  Datapath.  The 
Upper  Datapath  deposits  values  into  it  via  busS  during  4)2  and  it  is  one  of  the  potential  busD 
drivers  during  44. 

The  Upper  Datapath,  where  all  the  special  registers  reside,  performs  all  the  program  control 
related  operations.  Below  is  a  brief  description  of  each  block  in  the  Upper  Datapath,  from  left  to 


Staae/Phase 

Actions 

Ifet  Stage: 
Phase  3 

busi  <-  I-UnitlhusPCl : 

Exec  Stage: 

Phase  1 

busA  <-  REG_FILE[Rsl],  busB  <-  (not  REG_FILE[Rs2]) ; 
BUSBUFA  <-  busA, 
if  (busl<14>=0)  BUSBUFB  <-  (not  busB) 
else  BUSBUFB  <-  Sign  Extend  (busl<13:0>) ; 

Phase  2 

busA2  <-  BUSBUFA,  busB2  <-  BUSBUFB  ; 

Port  A  of  (ALU.  SHIFTER,  or  EXT_INS)  <-  busA2, 

Port  B  of  (ALU,  SHIFTER,  or  EXT  INS)  <-  busB2  ; 

Phase  4 

busD  <-  Output  Port  of  (ALU,  SHIFTER,  or  EXT_INS) ; 

Dstl  <-  busD : 

Mem  Stage: 

Phase  1 

busPC  <-  INC ; 

IfetPC  <-  busPC,  I-Unit  <-  busPC ; 

Phase  3 

Dst2  <-  Dstl ; 

Wr  Stage: 

Phase  3 

busA  <-  Dstl,  busB  <-  (not  Dst2) ; 

REG  FILE[rd]  <-  (busA  &  (not  busB)) ;  (Note  1) 

Table  2-3-1  Register-Register  Operation 

Load,  Return,  Read  Special,  and  Write  Special  type  instructions  are  special  cases  of  Register- 
Register.  Their  operations  are  shown  in  Appendix  A. 

Note: 

1.  To  write  a  register,  the  true  and  compliment  values  are  put  onto 
busA  and  busB  respectively. 
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right  as  shown  in  Figure  2-3-2: 

Cwp<4:2>  is  the  Current  window  pointer  that  points  to  the  register  window  that  is  currently 
in  use. 

Swp<31:3>  is  the  saved  window  pointer  that  points  to  the  memory  location  where  the  last 
overflow  register  window  (pointed  to  by  Swp<9;7>)  is  saved. 

TrapPC<31:2>  holds  the  word  address  of  the  trap  vector.  Whenever  a  trap  occurs,  the 
hardware  loads  the  upper  30  bits  of  the  byte  address  (hex)  OOOOIOTO  into  TrapPC<31:2> 
where  T  is  a  function  of  the  trap  type. 

CalIPC<31:2>  holds  the  target  address  for  Call  and  Jump  type  instructions.  This  address  is 
formed  by  concatenating  the  2  MSBs  of  ExecPC  and  a  28-bit  value  provided  by  the  instruc- 
tioa 

ADDER<31:2>  calculates  the  target  address  for  all  the  Compare-Branch  type  instructions 
while  the  ALU  is  doing  the  comparison.  The  8-bit  offset  is  fiist  sign  extended  before  adding 


Staae/Phase 

Action.s 

Ifet  Stage: 

Phase  3 

busi  <-  I-UnitfbusPCI : 

Exec  Stage: 

Phase  1 

bus  A  <-  REG_FILE[Rsl],  busB  <-  (not  REG_FILE[Rs2]) ; 
Mbr  <-  (not  busB),  BUSBUFA  <-  busA, 

BUSBUFB  <-  Sign  Extend  (busl<24:20>  cat  busl<8:0>) ; 

Phase  2 

busA2  <-  BUSBUFA,  busB2  <-  BUSBUFB  ; 

Port  A  of  ALU  <-  busA2,  Port  B  of  ALU  <-  busB2  ; 

Phase  4 

busS  <-  ALU  ;  Address  Pads  <-  busS 

Mem  Stage: 

Phase  1 

busPC  <-  INC,  busL  <-  Mbr ; 

IfetPC  <-  busPC,  I-Unii  <-  busPC, 

Data  Pads  <-  busL ; 

Table  2-3-2  Store  Operation 
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to  the  ExecPC. 

BUSS2PC<31:2>  receives  values  from  the  Lower  Datapath  via  busS  during  <t)4  and  drives 
busi  during  (t)l  for  Jump_Register  or  Return  type  instructions. 

IfetPC<31:2>  normally  holds  the  addresses  of  the  instructions  currently  in  the  Ifet  stage.  If 
the  instruction  currently  in  the  Ifet  stage  is  not  in  the  Instruction  Unit,  IfetPC  must  hold 
onto  the  next  instruction’s  address  until  the  miss  is  serviced. 

INC<31:2>  is  the  incrementor.  It  evaluates  the  next  instruction’s  address  for  sequential 
operation. 

ExecPC<31:2>  and  MemPC<31:2>  hold  the  addresses  of  the  instructions  currenUy  in  the 
Exec  and  Mem  stages  of  the  pipeline  respectively.  This  chain  of  PCs  is  needed  for  trap 


Staee/Phase 

Actions 

Ifet  Stage: 
Phase  3 

busI  <-  I-UnitlbusPCl : 

Exec  Stage: 

Phase  1 

busA  <-  REG_FILE[Rsl],  busB  <-  (not  REG_FlLE[Rs2]) ; 
BUSBUFA<-busA. 

if  {{Cond  =  eqjc)  or  {Cond  =  neqjc))  (Note  1) 
BUSBUFB<39:32>  <-  busl<14:9> 

else  if  (busl<14>  =  0)  BUSBUFB  <-  Zero  Extend  (busl<13:9>) 
else  BUSBUFB  <-  busB  ; 

Phase  2 

busA2  <-  BUSBUFA,  busB2  <-  BUSBUFB  ; 

Port  A  of  (ALU,  BRANCH_COND)  <-  busA2, 

Port  B  of  (ALU.  BRANCH  COND)  <-  busB2 : 

Mem  Stage: 

Phase  1 

if  (BRANCH_COND  =  valid) 

busPC  <-  ExecPC  +  Sign  Extend  (busl<8;0>) 

else  busPC  <-  INC  ; 

IfetPC  <~  busPC,  I-Unit  <-  busPC ; 

Table  2-3-3  Compare-Branch  Operation 


Notes: 

1.  Cond  is  the  conditional  field  of  the  instruction  (Figure  2-1-1).  eqjc  and 
neq  tc  are  two  possible  branch  conditions  (see  Appendix  A). 
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handling  (see  Section  2.3.3). 

FpuPC<31:2>  holds  the  address  of  the  last  FPU  (coprocessor)  instruction  that  was  send  to 
the  FPU.  This  is  needed  for  parallel  operation  between  the  FPU  and  CPU  [HaK86]. 

Upsw<31:2>  is  the  user  processor  status  word  (see  Appendix  A). 

Kpsw<31:2>  is  the  kernel  processor  status  word  (see  Appendix  A). 

In  order  to  get  a  better  idea  on  how  each  block  in  the  datapaths  is  being  used  during  each  of 
the  four  phases  at  different  pipeline  stages,  you  must  understand  the  operation  of  the  Execution 
Unit.  The  operations  of  the  Execution  Unit  under  normal  conditions  and  adverse  conditions  are 
discussed  in  Section  2.3.2  and  Section  2.3.3,  respectively. 

2.3.2.  Execution  Unit  Operation-Normal  Conditions 


Stage/Phase 

Actions 

Ifet  Stage: 

Phase  3 

busi  <-  I-Unit[busPC] ; 

CallPC  <-  ExecPC<31:30>  cat  busl<27:0> ; 

Exec  Stage: 

Phase  2 

if  (opcode  =  CALL)  Cwp  <-  Cwp  +  1, 
busS  <—  ExecPC ; 

BUSSTOD<-busS ; 

Phase  4 

busD  <-  BUSSTOD ; 

Dstl  <-  busD ; 

Mem  Stage: 

Phase  1 

busPC  <-  CallPC ; 

IfetPC  <-  busPC,  I-Unit  <-  busPC ; 

Phase  3 

Dst2  <-  Dstl ; 

Wr  Stage: 

Phase  3 

if  (opcode  =  CALL)  { 

Update  Backup  Copy  of  Cwp, 
busA  <-  Dstl,  busB  <-  (not  Dst2) ; 

REG  FILEird]  <-  (busA  &  (notbusB))}  ; 

Table  2-3-4  Call-Jump  Operation 


Chapter  2:  The  SPUR  CPU  Microarchitecture 


36 


In  Table  2-3-1  through  Table  2-3-4,  the  four  generic  instruction  types-Register-Register, 
Store,  Compare-Branch,  and  Call-Jump-are  used  to  show  the  Execution  Unit  operation  under 
normal  conditions.  Similar  tables  for  the  rest  of  the  instruction  types  are  presented  in  Appendix 
A.  In  these  tables,  different  phases  are  separated  by  a  single  horizontal  line  and  different  pipeline 
stages  are  separated  by  the  double  horizontal  lines.  Within  a  phase,  all  operations  are  in  parallel 
unless  they  are  separated  by  semicolon. 

The  general  timing  of  the  SPUR  CPU  is  summarized  in  Figure  2-3-2.  The  SPUR  CPU  uses 
a  four-phase  non-overlap  clock  [JBH87].  The  duration  of  each  phase  is  18ns  and  the  non-overlap 
time  is  7ns.  The  critical  paths  within  each  phase  must  be  shorter  than  the  phase  duration  (18ns) 
because  because  all  latches  in  the  SPUR  CPU  latch  in  data  during  the  falling  clock  edge.  Figure 
2-3-2  shows  that  the  critical  paths  for  the  register  file,  the  functional  unit,  the  instruction  unit,  and 
the  external  cache  all  have  at  least  4ns  safety  margin.  This  is  probably  why  most  of  the  CPUs  we 
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Figure  2-3-3  The  SPUR  CPU  Timing 

Each  block  represents  a  time  interval  and  the  number  inside  the  parentheses  is  the  duration  of  that 
time  interval  in  ns.  The  bit  lines  for  botli  the  register  file  and  the  instruction  cache  array  are 
prcchargcd  to  high  before  read  and  write.  For  the  register  file,  they  arc  precharged  during  (}i2  and 
(J)4.  For  the  instruction  cache,  they  are  precharged  during  (|)1  and  (()3.  Notice  that  almost  all  actions 
are  triggered  by  the  the  clock.  There  is  no  self  time  circuit  in  the  SPUR  CPU. 
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received  can  run  at  80ns  cycle  time.  Figure  2-3-2  also  iUustrates  one  big  drawback  of  multi-phase 
clocking-there  is  a  lot  of  dead  time  (horizontal  white  space  between  the  boxes).  Notice  that  not 
only  do  we  waste  time  during  the  non-overlap  time,  we  also  waste  the  time  at  the  end  of  each 
phase  due  to  the  requirement  for  a  safety  margin. 

23.3.  Execution  Unit  Operation-Adverse  Conditions 

Trap  Request  Handling.  A  trap  request  is  caused  by  unusual  conditions  that  arise  at  run 
time.  Trap  request  handling  refers  to  the  handling  of  these  unusual  run  time  conditions.  The 
detection  of  these  conditions  wiU  be  discussed  in  next  section.  This  section  explains  how  the 
SPUR  CPU  handles  trap  requests.  A  trap  request  is  handled  in  three  steps;  (1)  branch  to  a  loca¬ 
tion  defined  by  the  trap  type,  (2)  open  a  new  register  window,  and  (3)  save  the  addresses  of  the 
instructions  that  are  affected.  As  illustrated  in  Figure  2-3-4,  all  these  can  be  accomplished  by  the 
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Figure  2-3-4  Pipeline  During  Trap 


In  this  example,  trap  request,  which  is  asserted  in  the  Mem  stage  of  77,  can  only  due  to  either  77 
or  some  external  asynchronous  unusual  condition  that  happens  to  kill  77.  Obviously,  77  s  address 
must  be  saved  but  I2's  address  must  also  be  saved  because  10  may  be  a  delay  branch.  Due  to  the 
assertion  of  trap  request,  instructions  13  and  14' s  positions  in  the  pipeline  are  replaced  by  two 
internally  generated  instructions:  trap_call,  and  read _pc.  Trap_call  branches  to  the  trap  location, 
ojiens  a  new  window,  and  saves  Il’s  address  in  RIO  of  the  new  window.  Read_PC  saves  72  s  ad¬ 
dress  in  R16  of  the  new  window. 
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internal  instruction  sequence  trapjcall  followed  by  rd_pc.  Internal  instructions  are  generated  by 
the  Instruction  Unit  (see  Figure  2-2-1).  As  far  as  the  Execution  Unit  is  concerned,  rd_pc  is  the 
same  as  "id_special  rl6,  ExecPC"  and  the  only  difference  between  trapjcall  and  the  regular  call 
is  that  TrapPC  is  used  as  the  target  address  instead  of  CaUPC  (Figure  2-3-2).  Douglas  Johnson 
has  evaluated  the  effectiveness  of  the  SPUR  CPU  trap  architecture  [Joh88]. 

Pipeline  Suspension.  In  theory,  the  SPUR  CPU  pipeline  can  be  suspended  for  an  infinite 
number  of  cycles  as  illustrated  in  Figure  2-3-5.  Notice  that  every  instruction  in  the  pipeline  is 
suspended  unlike  the  situation  shown  in  Figure  2-2-5,  in  which  only  the  issue  of  new  instruction 
is  suspended.  Therefore,  we  refer  to  the  pipeline  suspension  in  Figure  2-3-5  as  Global  Pipeline 
Suspension  and  the  situation  shown  earlier  in  Figure  2-2-5  P artial  Pipeline  Suspension.  Global 
Pipeline  Suspension  is  used  to  handle  external  cache  miss  and  coprocessor  busy  conditions.  Par¬ 
tial  Pipeline  Suspension  is  used  to  handle  internal  instruction  cache  miss  because  as  explained  in 
Section  2.2.3-Global  Pipeline  Suspension  cannot  be  used  due  to  potential  deadlock  condition. 
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Figure  2-3-5  Global  Pipeline  Suspension 


The  SPUR  CPU  pipeline  can  be  suspended  for  two  reasons:  coprocessor  (FPU)  busy,  or  cache 
miss.  The  first  reason  is  out  of  the  scope  of  this  chapter  and  is  explained  in  [HaK86].  The  second 
reason  can  be  explained  using  this  figure  by  assuming  10  as  &  Load  or  Store  type  instruction.  If 
10  causes  an  external  cache  miss,  then  the  SPUR  CPU  pipeline  wUl  be  suspended  until  the  data  is 
valid. 
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2.4.  The  SPUR  CPU  Controller 

The  major  design  theme  behind  the  SPUR  CPU  controller  is  decentralization.  As  described 
in  Section  2.2  above,  the  Instruction  Unit  and  Execution  Unit  have  their  own  controllers.  Further¬ 
more,  using  internal  instructions  miss,  trapjcall,  and  read jjc,  the  Instruction  Unit  simplifies  the 
control  of  the  Execution  Unit  by  reducing  complex  control  functions  such  as  instruction  miss  and 
trap  handling  into  simple  instruction  sequences  that  can  be  executed  uniformly  by  the  Execution 
Unit’s  four-stage  pipeline.  Within  the  Execution  Unit,  the  control  responsibility  is  further 
delegated  to  three  independent  parts: 


Priority 

Trap  Type 

Side  Effects 

Description 

RESET  (0) 

1000 

kpei 

power-on  initiation 

1 

ERROR  (1) 

1010 

kpei 

bus  fault  or  hardware  error 

2 

WIN_OV  (2) 
WIN_UN  (3) 

mm 

window  overflow 
window  underflow 

3 

FAU_IN  (4) 

1040 

k 

page  fault  or  interrupt 

4 

1050 

FPU  exception 

5 

RUN_ER  (6) 

1060 

Run  time  software  errors: 
illegal  opcode 
kernel  mode  violation 

LISP  pointer  type  violation 

6 

TAG_TR  (7) 

1070 

Run  time  tag  violation: 
generation  trap 

LISP  data  type  violation 

7  (lowest) 

IN_OV  (8) 

1080 

Integer  overlow 

CMP_TR  (9) 

1090 

k 

cmp_trap  instruction 

Table  2-4-1  The  SPUR  CPU  Trap  Types 

In  the  table  above,  illegal  opcode  includes  all  the  FPU  opcodes  whenever  the  the  FPU  is  disabled. 
LISP  pointer  type  violation  occurs  when  tag  fails  the  "CONS  or  NIL"  test,  LISP  data  type  viola¬ 
tion  occurs  when  the  tags  fail  the  "Both  operands  are  FIXNUM"  or  "Both  operands  are  FIXNUM 
or  CHAR"  test. 

Side  Effects: 

k  -  changes  to  kernel  mode  e  -  turn  off  ERROR  detection 

p  -  changes  to  physical  mode  i  -  disable  the  InsUuction  Unit 
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Cache  Controller  Interface 

This  module  communicates  with  the  Cache  Controller  and  is  out  of  the  scope  of  this 
chapter.  The  cache  controller  interface  is  described  in  [WEG87]. 

Trap  Logic 

This  module  detects  the  unusual  conditions  (Table  2-4-1),  prioritizes  them,  and  determines 
which  trap  type  to  take.  The  Trap  Logic  is  discussed  in  Section  2.4.1. 

Control  Unit 

This  module  controls  the  Execution  Unit’  Upper  Datapath  and  Lower  Datapath.  The  Con¬ 
trol  Unit  is  discussed  in  Section  2.4.2. 

2.4.1.  Trap  Logic 

The  Trap  Logic  block  must  be  able  to  detect  thirteen  different  trap  conditions  (refer  to 
column  "Description"  of  Table  2-4-1).  Once  these  conditions  are  detected,  they  are  grouped  into 
ten  different  trap  types  that  are  prioritized  into  eight  priority  levels.  Each  trap  type  has  its  own  4- 
bit  trap  number  which  is  fed  into  TrapPC<7;4>  (Figure  2-2-2)  to  form  an  unique  trap  vector.  The 
SPUR  CPU  can  be  programmed  to  selectively  ignore  most  of  these  unusual  conditions  by  writing 
to  the  Upsw  and  the  Kpsw  (see  Appendix  A).  In  fact,  whenever  the  CPU  takes  a  trap,  the 
hardware  disables  any  further  traps  by  turning  off  the  AllEn  bit  in  the  Kpsw. 

Figure  2-4-1  shows  that  these  enabling,  detection,  prioritizing,  and  grouping  functions  are 
implement  by  five  logic  blocks  separated  by  latches.  Trap  Enable  and  Trap  Type  are  the  only 
logic  blocks  implemented  by  PLAs.  Trap  Enable  updates  the  on/off  status  of  the  various  traps 
according  to  the  contents  of  Kpsw  and  Upsw.  Trap  Type  groups  all  the  detected  unusual  condi¬ 
tions  into  trap  types  and  decides  which  trap  type  to  take  according  to  the  priority  shown  in  Table 
2-4-1.  The  three  Trap  Request  blocks  are  implemented  in  random  logic  and  together  they  gen¬ 
erate  the  trap  request  (assert  the  signal  trapRequest)  whenever  one  or  more  unusual  conditions  are 
detected.  Pre-Trap  Request  does  the  initial  set  up.  Trap  Request  (AND)  and  Trap  Request  (OR) 
are  analogous  to  the  AND  and  OR  plane  of  a  PLA.  The  reasons  why  they  are  not  combined  into 
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Figure  2-4-1  Trap  Logic  Block  Diagram 

The  pipeline  diagram  shows  how  the  Trap  Logic  operates  at  different  time  points  with  respect  to 
any  instruction  that  may  trap.  Each  pipe  stage  (clock  cycle)  consists  of  four  phases  and  the  falling 
clock  edge  of  these  phases  are  used  by  the  latches  to  latch  in  the  intermediate  results.  This  is  a 
simplified  view  because  in  the  real  hardware,  the  clock  phases  are  sometimes  "ANDed"  with 
some  other  control  signals  before  being  used  by  these  latches  as  trigger  signals.  Out  of  the  five 
combinational  blocks,  only  the  two  shaded  blocks  are  implemented  by  PLA’s.  Others  are  custom 
logic  due  to  timing  or  area  constraints  or  both. 


one  logic  block  are  the  same  as  the  reasons  for  not  combining  the  Output  and  State  logic  blocks 
in  Figure  2-2-3-it  makes  the  design  easier  to  understand  and  reduces  the  input-output  latency. 

2.4.2.  Control  Unit 

The  Execution  Unit’s  Control  Unit  (Figure  2-4-2)  is  divided  into  two  parts:  the  Master  Con¬ 
trol  and  the  Local  Decoding  Logic.  Master  Control  decodes  and  buffers  the  opcode  into  high 
level  control  signals.  The  Local  Decoding  Logic  then  decodes  these  high  level  control  signals 
into  low  level  control  signals  that  control  the  datapath.  The  coprocessor  interface  is  part  of  the 
Master  Control  but  will  not  be  discused  here.  In  a  simplified  view,  the  Master  Control  consists  of 


two  parts: 
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Figure  2-4-2  The  Control  Unit  Block  Diagram 


The  Control  Unit  can  be  divided  into  two  parts:  Master  Control  and  Local  Decoding  Logic.  In  the 
layout,  the  Master  Control  resides  in  the  center  of  the  chip  while  the  Local  Decoding  Logic 
blocks  are  scattered  along  the  Upper  Datapath  and  Lower  Datapath  close  to  where  the  low  level 
control  signals  are  needed. 


Opcode  PLA  and  Fast  Logic 

These  modules  decode  the  opcode  into  high  level  control  signals. 

Sequencer 

Tltis  module  sequences  the  high  level  signals.  Exec-Ctr-Buf,  Mem-Qr-Buf,  and  Wr-Ctr- 
Biif,  which  contain  latches  and  simple  logic,  together  combine  and  buffer  the  high  level  sig¬ 
nals  into  three  sets  that  are  responsible  for  the  control  of  the  pipeline  stages:  Exec,  Mem  , 
and  Wr,  respectively. 

There  arc  only  three  high  level  signals  for  the  Ifet  stage  of  the  pipeline.  However,  these 
must  be  provided  by  fast  logic  because  tlie  opcode  arrives  at  the  Control  Unit  during  0  of  the  Ifet 
stage  and  Uiesc  three  high  level  control  signals  must  be  valid  during  the  next  clock  phase  ((j>4)  of 
the  same  stage.  Tlie  Exec  stage  is  the  busiest,  which  is  reflected  by  the  large  number  of  high  level 
control  signals  (50)  needed  to  control  it.  The  Mem  stage  is  a  null  stage  except  for  memory  access 
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instructions  and  requires  only  seven  high  level  control  signals.  Finally,  the  Wr  stage  operation  is 
not  that  much  different  for  different  instructions  and  it  requires  only  eight  high  level  control  sig¬ 
nals. 

The  local  decoding  logic  is  organized  into  four  blocks,  each  of  which  is  specialized  in  con¬ 
trolling  one  local  area  of  the  datapath.  The  four  blocks,  as  shown  earlier  in  Figure  2-4-2,  are: 

Register  Control 

This  block  controls  the  register  file  and  temporary  registers;  Dstl,  Dst2,  and  Mbr. 
Functional  Unit  Control 

This  block  controls  the  functional  units;  Byte  Extractor  Inserter,  the  Shifter,  and  the  ALU. 


High  Level  Control  Signals 


Simple  Combinational  Logic 
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Figure  2-4-3  Local  Decoding  Logic 


The  Simple  Combinational  Logic  blocks  are  located  close  to  the  datapath  where  the  low  level 
control  signals  are  needed.  Each  of  these  logic  block  generally  consists  of  single  level  of  random 
logic  and  it  also  serves  as  a  buffer  between  the  high  level  and  low  level  control  signals. 
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Special  Control 

This  block  controls  the  special  registers:  Cwp,  Swp,  Ins,  Kpsw,  and  Upsw. 

PC  Control 

This  block  controls  the  program  counter  generation  logic:  ADDER,  INC,  and  the  various 
PCs. 

These  four  blocks  are  implemented  in  the  generic  structure  shown  in  Figure  2-4-3  in  which 
the  high  level  control  signals  are  decoded  by  the  Simple  Combinational  Logic  into  low  level  con¬ 
trol  signals.  There  are  never  more  than  two  levels  of  logic  in  the  Simple  Combinational  Logic 
and  its  outputs  are  then  either  used  directly  by  the  datapath  or  "ANDed"  with  one  of  the  four 
phases  before  they  are  used. 

2.4.3.  Controller  Design  Insights 


Parts 

Inputs 

Outputs 

Product 

Terms 

Logic 

Gates 

Implementation 
Effort  (man-month) 

Trap  Logic: 

Trap  Enable 

23 

12 

14 

0.25 

11 

9 

11 

— 

0.25 

14 

6 

— 

24 

0.50 

TrapReq.  (AND) 

24 

16 

— 

18 

0.50 

16 

10 

— 

19 

0.50 

Control  Unit: 
Opcode  PLA 

8 

40 

68 

0.50 

18 

14 

16 

— 

0.50 

10 

9 

— 

26 

0.50 

19 

13 

— 

26 

0.50 

17 

14 

— 

41 

0.50 

Spec  Ctr 

27 

19 

— 

23 

0.50 

Total 

— 

— 

109 

177 

5.00 

Table  2-4-2  The  Execution  Unit  Controller  Design  Metrics 
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The  implementation  metrics  of  the  Trap  Logic  and  the  Control  Unit  are  summarized  in 
Table  2-4-2,  The  implementation  of  the  SPUR  CPU  Controller  can  be  considered  as  an  experi¬ 
ment  which  shows  that  by  using  internal  instructions  (Example;  trap_call)  and  some  satellite 
logic  blocks  (Example:  Trap  Logic),  the  main  control  engine  that  controls  the  datapath  (Example: 
Control  Unit)  can  be  reduced  to  a  simple  N-Stage  sequential  logic  structure  (Figure  2-4-2)  where 
N  is  the  pipeline  length  of  the  machine  (Example:  N=4  for  the  SPUR  CPU).  The  term  "N-Stage 
sequential"  is  used  because  the  outputs  depend  on  the  inputs  of  the  previous  N  cycles  only.  There 
is  no  feedback  in  this  structure  and  therefore  it  is  not  a  state  machine.  This  N-Stage  sequential 
logic  block  has  well  defined  inputs-the  instruction  set  and  internal  instructions  (mainly  the 
opcode)— and  outputs— datapath  control  signals. 

While  reducing  control  functions  into  internal  instruction  sequences  and  designing  the  satel¬ 
lite  logic  blocks  may  stiU  require  some  human  ingenuity,  CAD  designers  should  be  able  to  pro¬ 
vide  CAD  tools  that  can  generate  the  N-Stage  sequential  logic  automatically.  Ideally,  a  VLSI 
designer  would  like  to  have  a  set  of  CAD  tools  that  can  partition  this  N-Stage  sequential  logic 
into  Master  Control  and  Local  Etecoding  Logic,  generate  them  automatically,  and  route  the  con¬ 
nections  between  the  two.  An  optimum  solution  is  hard  to  define  here  but  as  most  VLSI 
designers  can  tell  you,  the  optimum  solution  is  not  necessary  as  long  as  the  Master  Control,  the 
Local  Decoding  Logic,  and  the  routing  between  them  meet  the  area,  timing,  and  power  con¬ 
straints. 

One  final  point  is  that  reducing  complex  control  functions  by  internal  instructions  gives 
similar  benefits  to  those  found  in  microprogramming.  However,  in  microprogramming,  every 
instruction  (no  matter  how  simple)  is  turned  into  a  sequence  of  microinstructions.  On  the  other 
hand,  in  the  internal  instruction  approach,  only  complex  control  functions  are  turned  into 
sequences  of  internal  Instructions.  Tnese  internal  instruction  sequence  can  be  quite  short  because 
only  the  most  critical  steps  need  to  be  implemented.  The  rest  of  the  steps  can  easily  be  coded  as 
software  routine  using  the  regular  RISC-style  instructions  that  are  similar  to  traditional  microin- 
stiuctions.  In  other  words,  unlike  microprograming,  internal  instructions  allows  RISC-style 
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machine  to  handle  complex  situation  without  introducing  an  overhead  on  all  other  instructions. 
Furthermore,  internal  instructions  will  not  greatly  increase  the  complexity  of  the  instruction 
decoding  unit  because  they  are  similar  (if  not  the  same)  to  those  already  exist  in  the  RISC— style 
instruction  set  For  example,  in  the  SPUR  CPU,  internal  instruction  trap_call  is  similar  to  the  reg¬ 
ular  instruction  call  and  rd _pc  is  the  same  as  regular  instruction  rd_special  ExecPC. 
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Chapter  3 

THE  SPUR  CPU  EXPERIENCE 


The  biggest  performance  enhancement  is  achieved  when  going 
from  a  non-working  system  to  a  working  system. 

John  Ousterhout,  1988 


In  this  chapter,  I  wiU  talk  about  the  SPUR  CPU  experience.  The  lessons  I  learned  from  this 
experience  is  the  foundation  of  my  view  on  the  systematic  approach  to  microarchitectural  design 
(Chapter  5)  and  the  future  trends  (Chapter  6). 

3.1.  From  Chip  to  System 

SPUR’s  mission  is  not  only  to  build  a  VLSI  chip  but  to  build  a  system  around  three  custom 
VLSI  chips.  As  far  as  the  CPU  is  concerned,  the  implications  of  this  ambitious  mission  are: 

•  The  CPU  specifications  do  not  come  from  expert  VLSI  designers  whose  goal  is  to  build  the 
fastest  and  the  most  innovative  CPU.  The  specifications  come  from  the  system  goals. 

•  We  must  increase  our  chance  of  having  a  working  chip  by  using  as  much  proven  "technol¬ 
ogy"  as  possible. 

•  Instead  of  experimenting  with  architectural  ideas,  we  must  implement  certain  features  that 
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building  block  for  the  SPUR  system. 

The  first  implication  is  partly  responsible  for  our  relatively  slow  cycle  time  of  100ns.  We 
did  not  set  the  SPUR  CPU  cycle  time  goal  any  faster  than  100ns  because  the  SPUR  memory  sys¬ 
tem  [WEG87]  and  the  SPUR  bus  [Gib87]  cannot  run  any  faster.  In  the  following  sections,  1  will 
explain  the  other  implications  in  more  details. 

3.1.1.  The  Russian  Approach 

The  phase  "use  proven  technology"  means  we  tried  to  build  the  CPU  based  on  previous 
experience.  I  called  this  the  "Russian  approach"  because  the  Soviet  space  program  is  a  good 
example  of  not  feeling  ashamed  of  using  old  but  proven  technology.  Since  our  project  goal  is  to 
build  a  system  based  on  the  SPUR  CPU  chip,  we  decided  to  increase  our  chances  of  having  a 
woiking  chip  by  using  as  many  proven  ideas  as  possible  from  the  two  previous  generations 
Bcriceley  RISC  processor;  RISC  I  [Pat82]  and  RISC  II  [Kat83],  and  SOAR  [Ung84b].  As  men¬ 
tioned  in  Chapter  2,  the  SPUR  CPU  differs  in  these  five  aspects: 

Internal  Instruction  Cache 

The  SPUR  CPU  has  an  on-chip  512-byte  direct-mapped  instruction  cache  organized  into 
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Figure  3-3-1  RISC  H  Pipeline  vs.  SPUR  CPU  Pipeline 

Both  pipelines  consist  of  instruction  fetch  (Ifet),  execution  (Exec),  and  register  write  (Wr).  The 
SPUR  CPU  internal  instruction  cache  allows  Ifet  in  parallel  with  data  access  (Mem).  Data 
conflicts  in  both  pipelines  are  resolved  by  internal  forwarding  and  branch  conflicts  are  resolved 
by  a  single  cycle  delay  branch.  The  SPUR  CPU,  however,  requires  two  internal  forwarding  paths 
(RISC  II  and  SOAR  only  require  one)  due  to  the  extra  pipe  stage. 
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sixteen  blocks  with  eight  instructions  per  block. 

Four-Stage  Pipeline 

RISC  n  and  SOAR  use  the  same  three-stage  pipeline  that  is  shown  in  Figure  3-1-1.  The 
internal  instruction  cache  essentially  provides  the  SPUR  CPU  an  extra  port  to  memory  and 
enables  us  to  add  a  memory  access  stage  (Mem)  to  the  RISC  II  pipeline.  This  results  in  the 
SPUR  CPU  4-stage  pipeline  that  does  not  have  to  be  suspended  for  LOAD. 

Support  for  LISP 

The  SPUR  CPU  supports  LISP  by  three  types  of  hardware  tag  checking  [Tay86]:  data  type 
checking  for  general  operations,  pointer  type  checking  for  list  operations,  and  generation 
checking  for  garbage  collection  [Ung84a]  [ZHH88]. 

Cache  Controller  Interface 

In  order  to  support  multiprocessing,  the  SPUR  CPU  must  commumcate  constantly  with  the 
Cache  Controller  chip  via  a  cache  controller  interface  [WEG87]. 

Parallel  Coprocessor  (FPU)  Interface 

The  SPUR  CPU  supports  a  coprocessor  interface  which  allows  the  FPU  to  operate  in 


Feature  i 

Extra  Pipeline 
Staee  (Mem) 

On  Chip 
I-Cache 

LISP 

Support 

Floating  Pt. 
Support 

SPUR  CPU  vs. 

SPUR  CPU  -  Feature  i 

1.12 

1.30 

1.73 

30.0 

Performance 
Improvement  (%) 

12% 

30% 

73% 

2900% 

Table  3-1-1  Contributions  to  Performance 

The  performance  improvement  due  to  each  feature  (Row  2)  is  estimated  by  comparing  the  pCT- 
formance  of  the  SPUR  CPU  against  the  performance  of  an  imaginary,  stripped  down,  SPUR  CPU 
(Row  1).  The  details  of  this  analysis  can  be  found  in  Chapter  4.  All  numbers  are  only  approxima¬ 
tions  because  they  are  sensitive  to  the  frequencies  of  different  instructions  and  the  quality  of  the 
compiler. 
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parallel  with  the  CPU  [HaK86]. 

These  features’  contributions  to  performance  are  evaluated  in  Chapter  4.  The  results  are 
summarized  here  in  Table  3-1-1.  The  stripped  down  CPU  for  Column  1  uses  the  RISC  II  3-stage 
pipeline.  The  stripped  down  CPU  for  Column  2  does  not  have  an  on-chip  instmction  cache.  Since 
it  is  difficult  to  implement  the  SPUR  4-stage  pipeline  without  the  instruction  cache,  this  stripped 
down  CPU  also  uses  the  RISC  II  3-stage  pipeline.  The  stripped  down  CPU  for  Column  3  does  not 
support  hardware  tag  checking.  Similarly,  the  stripped  down  CPU  for  Column  5  does  not  support 
the  FPU  interface.  For  example  in  Column  3,  we  estimate  the  SPUR  CPU  to  mn  LISP  programs 
1.73  times  (73%)  faster  than  a  similar  CPU  without  hardware  tag  checking.  The  performance 
improvement  due  to  the  multiprocessing  support  features  is  not  included  because  we  believe  mul¬ 
tiprocessing  performance  depends  more  on  the  shared  bus  utilization  and  cache  performance 
[Kat85]  [EgK88]  than  the  features  in  the  CPU. 

3.1.2.  SPUR  CPU  System  Features 


Keys: 


USP 
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Figure  3-1-2  Impact  of  the  System  Features-Graphical 


This  illustrates  qualitatively  which  portion  of  which  module  is  affected  by  the  system  features. 
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The  SPUR  CPU  system  features  are  essential  to  the  SPUR  system’s  functionality  and  per¬ 
formance.  The  SPUR  CPU  system  features  come  from  three  sources  [Tay85]: 

Multiprocessing  and  Cache  Consistency  Support 

This  requires  seven  load  instructions,  three  store  instructions,  and  a  cache  controller  inter¬ 
face.  Although  aU  load  or  store  instructions  are  alike  internally,  the  CPU  must  request  dif¬ 
ferent  cache  operations  [KEW85]  [WEG87]  via  the  cache  controller  interface. 

LISP  Support 

This  requires  four  special  load  instructions,  one  read  tag  instruction,  one  write  tag  instruc¬ 
tion,  and  eight  extra  tag  bits  in  the  datapath.  This  tag  architecture  also  adds  six  branch  and 
five  trap  conditions. 


Multiprocessing 

LISP 

Floating  Point 
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Control  PLA 
Outputs 
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4/54 

7% 
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9.9/115 

9% 

0.4/115 

0% 

10% 

Number  of 
Signal  Pins 

15/156 

10% 

8/156 

5% 

37/156 

24% 

39% 

Table  3-1-2  Impact  of  the  System  Features-Quantitative 

The  first  row  shows  that  the  master  control  PLA  has  54  outputs.  Six  of  these  54  outputs  (11%) 
are  used  to  control  the  multiprocessing  supporting  features,  four  outputs  (7%)  are  used  to  control 
the  LISP  supporting  features,  and  three  outputs  (6%)  are  used  to  control  the  FPU  supporting 
features.  The  total  is  that  24%  of  the  master  control  PLA  outputs  are  used  to  control  the  system 
features.  Similarly,  the  second,  third,  fourth,  and  fifth  row  show  that  the  system  features  are 
responsible  for  10%  of  the  master  control  PLA  product  terms,  consume  21%  of  the  total  active 
area,  10%  of  the  total  transistors,  and  39%  of  the  total  signal  pins. 
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Floating  Point  Support 

This  requires  eight  load  instructions,  four  store  instructions,  and  a  coprocessor  interface. 

The  coprocessor  FPU  also  adds  two  branch  conditions  and  one  trap  condition. 

The  end  results  are  19  load  instructions,  seven  store  instructions,  and  a  40-bit  non-standard 
(not  32-bit)  datapath.  Furthermore,  the  CPU  must  be  able  to  handle  20  branch  conditions,  nine 
trap  conditions,  and  support  two  non-trivial  off-chip  interfaces.  The  impact  of  these  features  on 
resources  are  evaluated  quantitatively  in  Chapter  4.  The  results  are  illustrated  graphically  in  Fig¬ 
ure  3-1-2  and  summarized  quantitatively  in  Table  3-1-2.  The  complexity  of  these  features  is  not 
additive,  it  is  multiplicative!  These  features  must  be  simulated  at  the  behavioral  level  to  ensure 
they  are  implemented  correctly.  Since  the  complexity  of  these  features  is  multiplicative,  their 
simulation  effort  is  also  multiplicative. 

3.1.3.  Simulation  Strategy 
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Figure  3-1-3  Behavioral  Simulation  Strategy 

The  behavioral  model  of  the  SPUR  system  was  built  from  bottom  up.  The  lowest  level  modules, 
for  example  the  ALU,  were  first  modeled  and  a  set  of  test  vectors  was  written  to  verify  its  func¬ 
tionality.  Once  these  lowest  level  modules  were  verified,  they  were  grouped  together  to  form 
higher  level  composite  modules  following  the  hardware’s  hierarchical  organization.  This  "verify 
and  merge"  process  was  repeated  until  we  had  the  composite  module  of  the  SPUR  multiprocessor 
workstation. 
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At  the  behavioral  level,  the  complete  SPUR  system  is  described  in  the  ISP’  hardware 
description  language  [KWG87].  As  shown  in  Figure  3-1-3,  behavioral  simulation  is  divided  into 
two  categories:  chip  simulation  and  system  simulation.  The  SPUR  CPU  behavioral  model,  which 
is  discussed  in  more  details  in  Section  3.2,  is  not  only  used  for  chip  level  simulation  but  also  used 
as  a  building  block  for  the  processor  board  behavioral  model  for  system  level  simulation.  This  is 
necessary  because  not  only  do  we  need  to  verify  each  individual  chip,  we  also  need  to  verify  the 
interactions  among  the  chips  on  the  processor  board  and  eventually  the  interactions  among  pro¬ 
cessor  boards  in  the  multiprocessor.  If  our  mission  was  just  to  build  a  CPU  chip-the  missions  of 
RISC  I,  RISC  II.  and  SOAR-the  chip  simulation  would  have  been  sufficient  Since  our  mission  is 
to  build  a  system,  however,  we  must  also  complete  the  system  simulation.  After  the  behavioral 
simulation  is  completed,  the  behavioral  test  vectors  for  the  lowest  level  modules  are  converted  to 
switch  level  vectors  to  simulate  the  corresponding  layout  modules.  The  layout  modules  are  then 
merged  to  form  the  layout  of  the  CPU  which  is  verified  by  the  same  diagnostic  programs  used  for 
behavioral  simulation. 

Table  3-1-3  summarizes  the  switch  level  simulation  we  performed.  If  we  just  wanted  to 
prove  that  we  knew  how  to  budd  a  CPU  chip,  the  first  column  is  probably  all  the  switch  simula- 


General 

CPU 

Cache  Controller 
Interface 

Tags  & 
Traps 

FPU 

Interface 

Bootstrap 

Programs 

Total 

KMRTOiTOafnlM 

13,113 

(24%) 

13,875 

(25%) 

8,675 

(16%) 

1,543 

(3%) 

■niygH 

Man-Month 
of  Effort 

IjBnH 

l.O 

(29%) 

0.5 

(14%) 

3.5 

(100%) 

Table  3-1-3  Switch  Level  Simulation  Summary 

The  first  column  verifies  the  basic  function  of  the  CPU.  Tlie  next  three  columns  verify  the  utility 
features.  Boot  programs  are  simple  bootstrap  routines  that  are  used  to  bootstrap  the  SPUR  pro¬ 
cessor  board. 
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tion  we  would  have  needed.  Columns  2  through  5,  which  constitute  76%  of  the  cycles  and  86% 
of  the  effort,  represent  the  extra  simulation  we  have  to  do  in  order  to  build  a  chip  to  be  used  in  a 
system.  This  seems  to  indicate  that  it  is  three  to  six  times  the  effort  to  build  a  system  like  SPUR 
than  just  a  chip  like  RISC  n  and  SOAR.  Despite  the  large  amount  of  effort  we  spent  on  simula¬ 
tion,  simulation  is  only  the  tip  of  an  iceberg-the  rest  of  the  iceberg  is  the  design  process  dis¬ 
cussed  in  the  following  section. 

3.2.  The  SPUR  CPU  Design  Process 
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Figure  3-2-1  The  SPUR  CPU  Design  Process 

In  this  figure,  rectangular  boxes  represent  steps  in  the  design  process  while  hexagonal  boxes 
represent  products  of  the  design  steps.  This  is  a  simplified  view  because  we  do  not  show  all  the 
interactions  between  different  steps  that  make  iterations  necessary.  The  behavioral  description 
and  the  layout  are  the  two  most  important  products  from  the  microarchitecture  design  and  imple¬ 
mentation  steps  respectively. 
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A  simplified  view  of  the  SPUR  CPU  design  process  is  shown  in  Figure  3-2-1.  The  major 
steps  are:  Specification,  Macroarchitectural  Design,  Microarchitectural  Design,  and  Implementa¬ 
tion. 

Specification 

We  studied  multiprocessor  issues  and  the  tradeoffs  between  different  multiprocessor 
configurations  and  selected  the  shared  bus  configuration  (Figure  l-l-l(a))  to  fulfill  our  ini¬ 
tial  performance  goal.  Each  SPUR  processor  node  was  then  partitioned  into  a  large  cache 
memory  and  three  custom  VLSI  chips:  CPU,  CC,  and  the  FPU  (Figure  l-l-l(b)).  This 
Specification  step  created  a  textual  snecification  of  the  requirements  for  the  SPUR  CPU 
chip. 

Macroarchitecture  Design 

We  translated  the  textual  specification  into  the  instruction  set,  interfaces  specifications,  and 
algorithms  for  the  Cache  Controller,  the  Floating  Point  Unit,  and  the  CPU  chips.  In  order  to 
specify  each  chip  in  more  detail,  we  further  divided  each  chip  into  modules  which  I  called 
the  macro-modules.  The  Instruction  Unit  and  the  Execution  Unit  are  examples  of  macro¬ 
modules  in  the  SPUR  CPU.  As  far  as  the  CPU  is  concerned,  this  step  created  a  machine 
readable  architectural  description  which  enabled  us  to  perform  instruction  level  simulation 
to  evaluate  the  effectiveness  and  verify  the  correctness  of  this  macroarchitecture. 

Microarchitecture  Design 

The  microarchitect  studied  the  interactions  among  the  macro-modules  and  described  the 
interactions  in  an  behavioral  description  of  the  SPUR  CPU.  In  describing  the  SPUR  CPU 
behavior,  the  microarchitect  also  expanded  the  macro-modules  into  smaller  modules  which 
I  call  micro-modules  and  produced  a  block  level  design  and  a  floor  plan.  The  behavioral 
description,  which  is  the  SPUR  CPU  behavioral  model  shown  earlier  in  Figure  3-1-3, 
models  the  microarchitecture  and  must  be  verified  by  behavioral  level  simulation. 
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Implementation. 

The  behavioral  description  was  translated  into  logic  modules  either  automatically  by  CAD 
tools  (PLA)  or  by  the  logic  designers  (gates  and  latches).  The  circuit  designer  then  imple¬ 
mented  these  logic  modules  by  transistors  and  wires  that  were  eventually  translated  into 
layout.  The  layout  was  then  extracted  by  a  circuit  extractor  to  produce  the  switch  level 
description  that  can  be  verified  by  switch  level  simulation. 

Strictly  speaking,  fabrication  and  testing  are  not  part  of  the  design  process.  They  are 
included  in  Figure  3-2-1  for  the  sake  of  completeness.  Furthermore,  in  practice  the  SPUR  CPU 
design  process  is  not  a  pure  sequential  process.  A  lot  of  work-especially  among  the  microarchi¬ 
tecture  design  and  the  implementation  stcps-can  be  and  were  done  in  parallel.  ConsequenUy,  the 
total  nine  man  years  required  by  these  four  steps  were  accomplished  in  approximately  four  years 
by  five  graduate  students.t  The  initial  SPUR  study  was  done  in  the  Fall  1983  and  the  first  version 
of  the  SPUR  CPU  was  fabricated  in  the  Fall  1987. 

The  macroarchitectural  design  and  implementation  steps  of  the  SPUR  CPU  design  process 
are  discussed  in  more  details  in  [Tay86]  and  [Lee86].  This  thesis  will  focus  on  the  microarchitec- 
tuie  design  step.  The  most  important  product  of  the  microarchitectural  design  step  is  the  machine 
readable  behavioral  description  of  the  CPU.  This  behavioral  description  models  the  microarchi¬ 
tecture  and  was  shown  earlier  as  the  SPRU  CPU  Behavioral  Model  in  Figure  3-1-3  in  relation  to 
the  behavioral  model  of  the  SPUR  system.  Section  3.2.1  will  discuss  the  construction  of  this 
behavioral  model.  Section  3.2.2  discusses  how  this  behavioral  model  can  be  used  as  a  formal 
specification  for  logic  and  circuit  designers.  Section  3.2.3  discuss  how  this  behavioral  model  can 
be  used  for  layout  verification.  Finally,  Section  3.2.4  summarizes  some  important  observations 
from  the  SPRU  CPU  design  process. 


t  George  Taylor  is  the  macroarchitect,  Shing  Kong  is  the  microarchitect,  and  Dave  Lee  is  the  chief 
circuit  designer.  Wook  Koh  and  Rich  Duncombe  are  part  time  logic  and  circuit  designers  and  Mark  Hill  is 
our  macroarchitecture  consultant 
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3.2.1,  The  Construction  of  the  SPUR  CPU  Behavioral  Model 

The  SPUR  CPU  behavioral  model  [Kon89]  was  developed  using  the  N.2  hardware  model¬ 
ing  package  [EEE85].  In  the  N.2  environment,  a  piece  of  hardware  can  be  modeled  in  two  dif¬ 
ferent  ways  (see  Figure  3-2-2): 

(1)  As  a  primitive  module  that  is  described  in  ISP’,  or 


Figure  3-2-2  The  Structure  of  the  SPUR  CPU  Behavioral  Model 

The  three  shaded  blocks:  Lunit,  reg_file,  and  mastcr__cU'  arc  composite  modules.  All  other  blocks 
arc  primitive  modules.  Primitive  module  is  a  hardware  description  written  in  ISP’.  Composite 
module  is  a  collection  of  primitive  modules  connected  together  by  a  topology  file.  The  CPU 
behavioral  model  [Kon89]  is  by  definition  a  high  level  composite  modules  which  consists  of  both 
composite  and  primitive  modules. 
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(2)  As  a  composite  module  that  consists  two  or  more  primitive  modules  connected  together 
by  a  topology  file. 

The  ISP’  hardware  description  language  provides  the  designer  a  way  to  model  the  behavior  while 
the  topology  file  facility  provides  the  designer  a  way  to  model  the  structure. 

The  SPUR  CPU  behavioral  model  was  built  using  the  "meet-at-the-middle”  approach.  The 
desired  behavior  of  the  microarchitecture  was  first  determined  informally  and  then  it  was  decided 
how  this  behavior  can  be  implemented  structurally.  The  next  step  was  to  divide  the  conceived 
structure  hierarchically  into  modules.  Once  the  modules  were  defined,  the  SPUR  CPU  behavioral 
model  was  built  from  bottom-up.  The  lowest  level  modules  were  described  by  the  ISP’  hardware 
description  language  as  primitive  modules  and  a  set  of  test  vectors  was  written  to  test  each 
module’s  functionality.  Once  these  lowest  level  modules  were  tested,  they  were  grouped  together 
to  form  higher  level  composite  modules  following  the  hardware’s  hierarchical  organization.  This 
test  and  merge  process  was  repeated  until  the  CPU  composite  module  was  built.  Consequently, 
this  CPU  composite  module  actually  models  both  the  behavioral  and  structural  characteristics  of 
the  SPUR  CPU  microarchitecture  although  it  is  only  called  the  CPU  behavioral  model. 

The  structure  of  the  SPUR  CPU  behavioral  model  is  shown  in  Figure  3-2-2.  The  only  com¬ 
posite  modules  are:  reg_file,  master _ctr,  and  i_unit,  which  model  the  register  file  and  temporary 
registers,  the  master  control,  and  the  instruction  unit,  respectively.  Most  of  the  modules  in  the 
SPUR  CPU  behavioral  model  are  primitive  modules  because  we  try  to  ease  the  behavioral  model 
to  silicon  transformation  by  keeping  a  one-to-one  correspondence  between  the  behavioral 
modules  and  the  prospective  layout  modules.  Consequently,  most  of  the  behavioral  modules  are 
therefore  simple  components— ALU,  SHIFTER  and  so  on— whose  behavior  is  well  understood  and 
can  be  described  easily  in  single  ISP’  primitive  module. 
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3.2.2.  Behavioral  Model-Formal  Specification  for  Logic  and  Circuit  Designers 

The  behavioral  model  of  the  SPUR  CPU  was  not  only  used  for  CPU  chip  and  SPUR  system 
verification  (Section  3.1.3),  but  was  also  used  as  a  "formal  specification  for  logic  and  circuit 
designers.  This  is  illustrated  by  the  example  in  Figure  3-2-3.  The  microarchitect  prepared  a  block 
diagram  that  showed  the  input  output  interfaces  and  the  logic  at  register  transfer  level  for  each 
module  in  the  behavioral  model.  Furthermore,  as  mentioned  earlier  in  Section  3.2.3,  a  set  of  test 
vectors  was  created  for  each  module  to  exercise  its  functionality  during  the  constmction  of  the 
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if  (selectBusA  eq  1) 
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Figure  3-2-3  Formal  Specification  for  Logic  and  Circuit  Design  Example 

In  this  simple  example,  there  are  two  32-bit  busses  (busA  and  busB),  a  two  by  one  multiple! 
(MUX),  and  a  32-bit  register  (REGl).  The  signal  phil  is  a  clock  signal  and  selectBusA  is  a  con¬ 
trol  signal.  Register  REGl  will  latch  in  the  value  on  either  busA  or  busB  during  every  phil .  This 
behavior  is  described  textually  by  the  behavioral  model,  illustrated  graphically  by  the  block  di¬ 
agram,  and  verified  by  the  test  vectors. 
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behavioral  model.  The  behavioral  model  of  module,  the  block  diagram,  and  the  test  vectors 
together  form  the  specification  of  the  module  to  be  implemented  by  the  logic  and  circuit 
designers. 

Ideally,  we  would  like  to  have  CAD  tools  to  generate  the  block  diagram  automatically  from 
the  behavioral  model,  or  vice  versa.  We  would  also  like  to  have  another  tool  to  generate  the  test 
vectors  automatically  from  either  the  behavioral  model  or  the  block  diagram.  Finally,  we  would 
like  to  have  some  module  generators  to  generate  the  layout  for  us  automatically  from  this  formal 
specification.  Alas,  such  ideal  CAD  tools  were  not  available  for  SPUR.  The  block  diagram,  the 
test  vectors,  and  most  of  the  logic  design,  circuit  design,  and  layout  had  to  be  done  by  hand.  The 
only  layout  that  can  be  generated  automatically  from  the  behavioral  description  was  simple  con¬ 
trol  blocks  from  the  PLA  generators. 

3.2.3.  Behavioral  Model-An  Aid  for  Switch  Level  Simulation 

The  layout  created  by  hand  must  be  verified  by  switch  level  simulation  to  ensure  it  is  ftme- 
tionally  correct.  The  verification  process  shown  in  Figure  3-2-4  ensures  the  layout  behaves  the 
same  as  specified  in  the  behavioral  model.  As  discussed  earlier,  there  is  a  one-to-one  correspon¬ 
dence  between  the  behavioral  module  and  the  layout  module  and  every  behavioral  module  has  its 
own  set  of  test  vectors.  By  mnning  this  set  of  test  vectors  through  the  behavioral  simulator,  we 
can  trace  the  results,  and  do  a  simple  fonnat  conversion  to  obtain  the  switch  level  test  vectors  for 
the  layout  of  that  module.  After  the  layout  of  all  the  modules  was  tested  individually,  they  were 
merged  to  fonn  the  CPU  chip. 

The  behavioral  model  of  the  CPU  is  not  verified  by  test  vectors.  It  is  verified  by  diagnostic 
programs  written  in  SPUR  assembly  language.  Since  the  N.2  behavioral  simulator  supports 
simulated  memories,  all  we  had  to  do  was  to  generate  a  memory  image  using  the  SPUH  assem¬ 
bler  and  linker,  load  this  image  into  the  simulated  memory,  and  start  the  execution.  This  simu¬ 
lated  execution  was  traced  and  the  trace  informations  was  tlien  coverted  to  switch  level  test  vec¬ 
tors  for  global  switch  level  simulation  of  the  CPU  chip. 
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Source  1 


Figure  3-2-4  The  Verification  Process 

The  behavioral  description  on  the  right  can  be  the  description  of  a  primitive  module,  or  a  compo¬ 
site  module,  or  even  the  description  of  the  CPU.  The  behavioral  description  is  tested  by  the 
behavioral  simulator.  Behavioral  simulation  guidance  can  be  provided  in  two  different  ways. 
Source  1:  Test  vectors  are  used  to  test  the  primitive  and  composite  modules’  functionality. 
Source  2:  Diagnostic  programs  that  are  assembled  into  a  memory  image  are  used  to  test  the 
description  of  the  complete  CPU. 


3,2.4.  Important  Observations 

Three  important  observations  can  be  derived  from  the  SPUR  CPU  design  experience.  These 
three  observations  will  become  important  considerations  as  I  try  to  develop  a  more  analytical 
approach  to  microarchitectural  design  in  Chapter  5  and  predict  the  future  trends  in  Chapter  6.  The 
three  observations  are: 

(1)  A  CAD  tool  that  can  transform  the  behavioral  model  directly  to  the  layout-a  silicon 
compiler-will  be  extremely  useful,  but  still  will  not  solve  aU  the  problems.  As  illustrated 
in  the  SPUR  design  process  (Figure  3-2-1),  such  a  CAD  tool  will  only  simplify  the 
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High  Level  Model 


Low  Level  Model 


Figure  3-2-5  Conflicting  Requirements  of  the  Behavioral  Model 


The  behavioral  model  can  be  a  good  tool  to  evaluate  the  effectiveness  of  alternative  microarchi¬ 
tecture  if  it  can  be  build  rapidly.  This  require  the  description  to  be  more  absu^ict  -  higher  level. 
On  the  other  hand,  if  one  want  to  be  able  to  transform  the  behavioral  description  into  layout  easi¬ 
ly  or  even  automatically,  it  has  to  be  less  abstract  -  low  level.  Furthermore,  if  one  wants  to  use 
the  behavioral  verification  results  to  drive  the  switch  level  simulation  effectively,  the  one-to-one 
mapping  between  the  Behavioral  description  and  the  layout  must  be  carry  on  to  a  relatively  low 
level. 


implementation  step.  The  microarchitectural  design  step  is  still  a  major  task  by  itself. 
You  may  argue  that  if  the  same  microarchitecture  is  to  be  implemented  in  different  tech¬ 
nology  as  different  products,  then  the  microarchitectural  development  cost  can  be  divided 
among  different  products.  Practical  experience  showed  that  to  get  the  highest  perfor¬ 
mance  from  technology,  however,  the  microarchitecture  has  to  be  customized  to  a  tech¬ 
nology  [Pat89].  Finally,  if  the  microarchitectural  design  is  poor,  it  will  be  impossible  for 
any  silicon  compiler  to  generate  a  good  implementation  from  iL 

(2)  Verification  is  time  consuming-it  requires  a  lot  of  human  interaction  time  because:  (a) 
the  designer  has  to  create  all  the  test  cases  either  directly  in  the  form  of  test  vectors  or 
indirectly  via  diagnostic  programs,  and  (b)  the  designer  has  to  interpret  the  verification 
result.  To  make  matter  worse,  verification  is  usually  done  repeatedly  at  different  levels. 
For  example,  in  the  SPUR  CPU  design  process  (Figure  3-2-1),  verification  is  done  after 
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each  major  steps  by  Instruction  Level  Simulation,  Behavioral  Level  Simulation,  and 
Switch  Level  Simulatioa  In  order  to  speed  up  the  design  process,  the  human  interaction 
time  needed  for  verification  must  be  reduced.  The  options  are:  (a)  reduce  the  time  it  takes 
to  generate  the  test  cases,  or  (b)  reduce  the  redundant  verifications  among  different  levels. 

(3)  There  are  two  conflicting  requirements  for  the  description  that  models  the  microarchitec¬ 
ture  (Figure  3-2-5).  The  SPUR  CPU  behavioral  model  [Kon89]  is  more  towards  the  low 
level  for  two  reasons.  First  of  all,  we  want  to  use  the  verification  results  of  the  behavioral 
model  to  drive  our  switch  level  simulation.  Secondly,  we  started  writing  the  behavioral 
model  late-we  had  already  committed  on  most  of  the  microarchitectural  features.  Conse¬ 
quently,  instead  of  being  a  microarchitecture  test  bed,  the  behavioral  model  was  used 
mainly  as  a  specification  for  logic  and  circuit  designers.  Ideally,  we  would  like  to  model 
as  many  alternative  microarchitectures  as  possible  such  that  we  can  evaluate  the  effec¬ 
tiveness  of  each  alternative  quantitatively.  Due  to  the  time  we  spent  in  modeling  the 
SPUR  CPU  at  low  level,  we  could  not  afford  major  alteration  in  the  SPUR  CPU  microar¬ 
chitecture  by  the  time  we  completed  the  first  SPUR  CPU  model.  In  the  future,  I  think 
VLSI  designers  should  work  on  a  high-level  behavioral  model  earlier  as  a  test  bed  for 
microarchitectural  ideas.  Once  the  high-level  description  is  completed-hopefuUy  with 
extensive  CAD  tools  support-the  designer  can  transform  it  into  a  low-level  description 
for  layout  generation  and  switch  level  simulation. 

3.3.  The  SPUR  CPU  Problems 

AH  known  SPUR  CPU  problems  and  their  solutions  are  listed  in  Appendix  B.  Tins  section 
discusses  the  more  ’’educational"  problcms-problems  that  taught  us  some  valuable  lessons.  The 
CPU  problems  can  be  classified  into  three  groups: 

(1)  Microarchitectural  Problems 

The  CPU  chip  is  doing  exactly  what  the  microarchitect  designed  it  to  do  although  it  is  not 
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doing  what  the  microarchitect  wanted  it  to  do.  The  microarchitect  has  designed  it  wrong! 
These  problems  can  be  simulated  in  behavioral  and  switch  level  simulation.  They  were  not 
detected  during  simulation  because  we  did  not  cover  all  possible  cases  or  we  did  not  realize 
they  were  problems. 

(2)  Electrical  Problems 

The  CPU  chip  is  not  doing  what  the  microarchitect  nor  the  logic  designer  designed  it  to  do 
due  to  unexpected  electrical  problems.  These  problems  cannot  be  simulated  in  behavioral 
nor  switch  level  simulation.  Careful  and  in-depth  circuit  simulation  is  the  only  way  to 
detect  these  problems.  These  problems  exist  because  the  switch  level  simulation  is  not  low 
level  enough  and  it  is  not  practical  to  run  circuit  simulation  for  the  entire  chip. 

(3)  Implementation  Problems 

The  CPU  chip  is  doing  exactly  what  the  logic  or  circuit  designer  designed  it  to  do  although 
it  is  not  doing  what  the  microarchitect  want  it  to  do.  The  logic  or  circuit  designer  imple¬ 
mented  something  differendy  than  what  the  microarchitect  had  in  mind!  These  problems 
may  be  detected  by  comparing  the  switch  level  simulation  results  against  behavioral  level 
simulation  results  if  both  the  switch  level  and  behavioral  level  descriptions  have  the  proper 
level  of  detail.  These  problems  exist  because  of  miscommunication  between  the  microarchi¬ 
tect  and  the  logic  or  circuit  designer. 

3.3.1.  Microarchitectural  Problems 

The  most  educational  microarchitectural  problem  for  SPUR  is  in  the  design  of  special  regis¬ 
ters.  The  SPUR  CPU  special  registers  that  have  potential  problems  are: 


Cwp  Current  register  window  pointer. 
Swp  Save  register  window  pointer. 

Kpsw  Kernel  processor  status  word. 
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_ LX^QRoryy.  l  -Backup  _ 

(a)  Structure  of  Current  Window  Pointer 


Assume  the  operation  that  changes  Cwp  from 
OLD  to  NEW  is  'Tolled"  by  a  trap! 


trapRequest^ 
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wmsm 
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(b)  Timing  in  the  Pipeline 


Figure  3-3-1  Structure  and  Timing  of  the  Special  Registers 

The  structure  of  Cwp  is  shown  in  (a).  All  special  registers  are  similar  in  that  they  all  consist  of 
two  parts;  Current  and  Backup.  Any  instruction  that  modifies  the  special  register  changes  the 
Current  part  during  either  ^  of  its  Exec  stage  (for  Cwp,  it  can  also  be  changed  during  (t)2  instead 
of  (>4)  and  updates  the  the  Backup  during  4)3  of  its  Wr  stage,  (b)  shows  how  the  SPRU  CPU  can 
recover  the  special  register  to  its  old  value  during  4>4  of  the  Mem  stage  if  the  instruction  is  killed 
by  a  trap. 


Upsw  User  processor  status  word. 

Ins  Insert  byte  count  register. 

These  special  registers  are  described  in  details  in  Appendix  A.  Figure  3-3-1  shows  the  struc¬ 
ture  and  the  timing  of  the  special  register  Cwp.  Cwp  is  the  most  complex  special  register  because 
it  can  be  loaded  from  four  different  sources  (see  Figure  3-3- 1(a)): 

(1)  Load  from  busS  during  4>4  for  WR_SPECIAL  instruction. 

(2)  Load  from  its  Backup  Copy  during  4>4  when  there  is  a  trap. 

(3)  Load  from  its  plus-one  copy  during  4)2  for  CALL  instruction. 

(4)  Load  from  its  minus-one  copy  during  4)2  for  RETURN  instruction. 

All  other  special  registers  have  similar  structure  and  timing  as  the  Cwp  but  they  can  only  be 
loaded  during  4)4  from  two  sources:  busS  or  the  Backup  copy.  I  have  simplified  Figure  3-3-l(a) 
by  showing  all  storage  nodes  as  dynamic  latches— a  simple  pass  transistor  follow  by  a  buffer.  In 
the  SPRU  CPU,  the  Current  and  Backup  have  to  be  pseudo-static  registers.  Furthermore  aU  pass 
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gates  in  the  SPUR  CPU  are  composite  pass  gates  formed  by  connecting  NMOS  and  PMOS 
transistors  in  parallel. 

The  philosophy  behind  the  special  register  design  is  that  the  Current  part  is  changed  as  soon 
as  possible  such  that  the  next  instruction  can  use  the  new  value.  The  Backup  part  is  needed  to 
recover  the  old  value  if  the  instruction  that  changes  the  special  register  is  "kiUed"  by  a  trap.  This 
is  illustrated  in  Figure  3-3-l(b)  which  shows  that  even  if  the  instruction  is  "killed"  in  the  last  pos¬ 
sible  time  (during  (j)2  of  its  Mem  stage-see  Figure  2-3-4),  the  SPRU  CPU  can  stiU  recover  the 
special  register  to  its  old  value  during  (|>4  of  the  Mem  stage. 

The  first  mistake  I  made  in  designing  the  special  register  can  be  traced  to  Figure  3-3-l(b).  In 
the  regular  register  case,  if  any  instruction  that  modifies  regular  registers  is  killed  by  a  trap,  its  Wr 
stage  is  disabled  and  its  destination  register  is  not  modified.  In  the  special  register  case,  if  any 
instruction  that  modifies  any  special  registers  is  killed  by  trap,  Figure  3-3-l(b)  seems  to  indicate 


lrapRequesi=»l 


3:Cwp=Q  4:  Tcmp«»Q 
(a)  Protenlial  Problem 


6;  Backu(>=^ 


Cwp=N+l  Temp=N+l  Backup=N+l 

(b)  SPUR  CPU  Problem  Assume  originally  Cwp=N 


Figure  3-3-2  Problems  with  the  Special  Registers 

(a)  shows  the  potential  problem  with  the  special  registers  when  two  consecutive  instructions  try 
to  modify  the  same  special  registers.  Since  the  second  instruction  (12)  changes  the  Temp  during 
(|)1  of  its  Mem  stage  (Step  4)  before  the  first  instruction  (II)  uptUUes  the  Backup  (Step  5),  Backup 
will  get  the  latest  value  from  Temp  one  cycle  loo  early  (in  Step  5  instead  of  Step  6).  (b)  shows 
how  this  potential  problem  turn  into  real  problem  in  the  SPUR  CPU  when  a  Call  or  Return  in¬ 
struction  is  killed  by  a  trap  during  its  Exec  stage.  Since  the  internal  instruction  Trap_call  uses  the 
Backup  copy  of  the  Cwp  to  decide  where  to  save  the  return  address  during  4)2  of  its  write  stage, 
the  return  address  is  saved  in  tlie  new  window  (N+1)  instead  of  the  desired  old  window  (N). 
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that  writing  the  Backup  copy  does  not  change  anything-the  Backup  remains  equal  to  OLD.  Sum¬ 
marizing  this  error. 

Mistake  1 

Instead  of  treating  special  registers  the  same  way  as  I  treated  regular  registers,  1  did  not  dis¬ 
able  the  Wr  stage  (did  not  set  write'x.<^7)  in  Rgure  3-3- 1(a)  to  0)  of  instructions  that  modify 
special  registers  even  if  it  is  killed  by  a  trap.  I  had  too  much  confidence  in  Figure  3-3-l(b). 

The  structure  shown  in  Figure  3-3-l(a)  has  another  potential  problem.  It  will  not  allow  two 
consecutive  instructions  to  modify  the  same  special  register.  This  is  illustrated  in  Figure  3-3-2(a) 
where  the  first  instruction  wants  to  set  the  Cwp  to  P  while  the  second  instruction  wants  to  set  the 
Cwp  to  Q.  Due  to  the  timing  and  the  limitation  of  the  structure,  the  Backup  copy  has  the  wrong 
value  ((J)  instead  of  the  correct  value  (P)  during  the  time  period  Tcriticat  • 

(t)3  of  Il’sWr  Stage  <  TcnHad  <  (t>3  of  I2’s  Wr  stage 
If  the  second  instruction  (12)  need  to  use  the  Backup  during  Tentuai » it  will  get  the  wrong  value.  I 
discovered  this  problem  very  early  in  the  design  process.  I  also  noticed  that  this  problem  can  be 
fixed  easily  by  adding  one  more  temporary  latch  between  the  Current  and  its  Backup.  Unfor¬ 
tunately,  this  was  not  done  because  I  made  my  second  and  third  mistakes: 

Mistake  2 

I  thought  the  only  time  the  second  instruction  used  the  Backup  is  when  it  is  trapped  in  its 
Mem  stage  as  shown  in  Figure  3-3- 1(b). 

Mistake  3 

I  thought  nobody  in  his  right  mind  wiU  try  to  change  the  same  register  in  two  consecutive 
instructions  as  in  Figure  3-3-2(a)  because  the  first  instruction  can  be  replaced  by  an  NOOP. 

These  two  mistakes  lead  me  to  my  fourth  mistake: 

Mistake  4 

Instead  of  fixing  the  problem  in  hardware,  I  established  a  software  restriction  forbidding 
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instruction  sequence  that  has  consecutive  instructions  modity  the  same  special  register. 

Needless  to  say,  I  was  very  surprised  when  the  potential  problem  shown  in  Figure  3-3-2(a) 
turn  into  a  real  problem  in  the  SPUR  CPU  shown  in  Figure  3-3-2(b).  The  two  suiprises  are: 

Surprise  1 

The  discovery  of  Mistake  3.  Two  consecutive  instruction  modifying  the  same  register  can 
happen  implicitly  when  a  Call  or  Return  is  killed  by  a  trap  during  its  Exec  stage.  This  is 
shown  in  Figure  3-3-2(b).  The  internal  instruction  Trap_call,  which  are  placed  in  the  pipe¬ 
line  by  the  trapRequest  signal  (see  Figure  2-3-4),  modifies  the  special  register  Cwp  the 
same  way  as  the  regular  Call. 

Surprise  2 

The  discovery  of  Mistake  2.  The  second  instruction,  in  the  case  of  Figure  3-3-2(b),  the 
internal  instruction  Trap_call  will  use  the  Backup  during  Tcritkai  even  it  is  not  trapped.  Simi¬ 
lar  to  the  regular  Call,  the  Trap.caU  use  the  Backup  copy  to  decide  which  register  window 
to  save  the  return  address. 

Despite  these  two  surprises,  the  potential  problem  shown  in  Figure  3-3-2(a)  still  would  not 
have  turned  into  a  real  problem  in  the  SPUR  CPU  shown  in  Figure  3-3-2(b)  if  I  had  not  made 
Mistake  1.  If  I  did  not  make  Mistake  1,  the  Wr  stage  of  Call  or  Return  in  Figure  3-3-2(b)  would 
have  been  disabled  by  the  trapRequest  and  the  Backup  would  not  have  been  clobbered.  Tom 
Wolfe,  in  his  book  "The  Right  Stuff,  observed  that  many  military  pilots  believed  a  pilot  was 
never  killed  by  a  single  mistake.  Well,  in  this  case,  I  sure  made  enough  mistakes  to  get  the  CPU 
into  serious  trouble! 

Fortunately,  the  case  shown  in  Figure  3-3-2(a)  is  a  very  unusual  case  and  it  will  never  hap¬ 
pen  when  the  on-chip  instruction  cache  is  disabled.  Unfortunately,  it  is  so  unusual  that  it  was 
never  tested  in  behavioral  simulation  and  we  did  not  detect  this  error  until  we  have  brought  up  the 
operating  system  and  decided  to  turn  on  the  on-chip  instruction  cache  to  increase  the  speed.  As  a 
matter  of  fact,  this  is  the  only  reason  why  we  have  problem  turning  on  the  on-chip  instruction 
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cache.  The  mmor  about  the  on-chip  instruction  cache’s  problem  has  been  greatly  exaggerated! 

Notice  that  the  problem  shown  in  Figure  3-3-2(b)  is  only  a  transient  error  because  the 
Backup  Cwp  simply  has  the  "correct"  value  at  the  wrong  time  (one  cycle  too  early).  Once  the 
Trap_call  finishes  its  execution,  it  becomes  perfectly  legal  for  the  Backup  to  have  this  latest 
value.  However,  this  is  bad  enough  to  cause  the  Trap_caU  to  save  the  return  address  in  R26  of 
the  new  register  window  instead  of  R26  of  the  old  register  window.  This  problem  does  have  a 
simple  software  solution.  If  by  software  convention,  all  procedures  and  trap  handlers  must  set  its 
R26  to  zero  before  returning  to  its  return  address,  then  the  SPUR  CPU  can  check  R26  of  the  new 
window  whenever  it  takes  a  trap.  If  it  is  not  zero,  the  case  shown  in  Figure  3-3-2(b)  must  have 
occurred.  The  software  can  then  find  out  what  the  return  address  is  by  reading  and  saving  this 
register. 

One  big  lesson  we  learned  here  is:  Keep  it  regular!  Whenever  you  make  an  exception  (Mis¬ 
take  1  and  4),  there  is  likely  to  be  some  unexpected  cases  to  get  you  in  the  most  unexpected  way. 
Another  lesson  is  that  there  may  be  many  cases  you  may  never  thought  of  during  simulation 
(Mistake  2  and  3)  and  one  must  find  some  easy  way  to  cover  more  cases  in  simulation.  This  is 
discussed  further  in  Section  3.4,1. 

3.3.2.  Electrical  Problems 

The  microarchitect  is  not  the  only  person  in  SPUR  that  makes  multiple  mistakes.  The  cir¬ 
cuit  designer  also  made  multiple  mistakes  at  the  electrical  level  that  resulted  in  an  electrical  prob¬ 
lem  in  tire  SPUR  CPU.  Tliis  is  discussed  in  Section  3. 3. 2.1.  After  discussing  all  these  problems 
caused  by  multiple  mistakes,  Section  3. 3.2.2  shows  how  one  single  mistake  can  ruin  your  whole 
day! 

3,3.2.I.  A  Hazardous  Circuit 

The  circuit  shown  in  Figure  3-3-3(a)  is  hazardous  because  the  clock  signal  clock  is  gated 
with  other  inputs  in  a  way  that  it  is  forced  to  pass  through  two  different  paths  before  it  is  merged 
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Figure  3-3-3  A  Hazardous  Circuit 


The  case  we  are  interested  is  when  Input  =  5F  and  clock  goes  from  OV  to  5V.  When  clock 
equals  to  OV,  Node  ExecRd  is  charged  to  5V.  When  clock  switches  from  OV  to  5V,  we  want  Ex- 
ecRd  to  stay  at  5V.  The  only  hazard  that  may  discharge  ExecRd  is  that  there  may  be  glitches  on 
the  ldRd_L  and  IdRd  signals.  Our  SPICE  circuit  simulation  showed  this  hazard  can  occur  only  if 
C2>3xCi.  However,  as  shown  in  (b),  even  if  there  are  glitches  on  ldRd_L  and  IdRd,  ExecRd  is 
still  not  destroyed. 


again  into  another  signal  (ldRd_L).  This  hazard,  however,  cannot  be  detected  by  a  switch  level 
simulator  that  does  not  have  a  good  capacitance  model.  When  the  effect  of  the  parasitic  capaci¬ 
tors  Cl  and  C2  are  ignored,  the  bottom  path  in  Figure  3-3-3(a)  has  one  less  gate  delay  and  will  go 
0  before  the  top  path  goes  to  5V.  There  will  not  be  any  glitch  on  the  control  lines  ldRd_L  and 
IdRd. 

Mistake  1 

We  believed  our  switch  level  simulation  and  kept  a  hazardous  circuit  in  our  design  that 
combines  clock  signal  with  other  signals  at  a  place  other  than  at  the  control  point. 

This  hazard  can  be  detected  by  careful  circuit  level  simulation  using  SPICE.  If  C2  >  3xCi, 
the  top  path  in  Figure  3-3-3(a)  will  go  to  5V  before  the  bottom  path  go  to  OV.  There  wiU  be 
glitches  on  the  control  lines  ldRd_L  and  IdRd.  However,  if  this  is  the  only  problem.  Figure  3-3- 
3(b)  shows  that  ExecRd  still  will  not  be  discharged  unintentionaUy  because  the  glitches  on  the 
control  lines  ldRd_L  and  IdRd  are  not  big  enough.  But  then  again,  as  the  great  philosopher  Mur- 
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Figure  3-3-4  Problems  of  the  Hazardous  Circuit 

The  SPUR  CPU  clock  line  is  modeled  as  an  LC  network  in  (a).  This  simulation  indicates  that 
there  wUl  be  some  ringing  in  the  clock  signal  clock .  (b)  shows  how  the  glitches  on  ldRd_L  and 
IdRd,  which  were  insignificant  in  Figure  3-3-3(b),  are  now  amplified  by  the  glitches  in  the  clock 
signal.  These  larger  glitches  are  big  enough  to  discharge  ExecRd  accidently. 


phy  had  predicted,  things  usually  get  worse  before  getting  any  better.  We  made  our  second  mis¬ 
take: 

Mistake  2 

Instead  of  placing  the  clock  generator  and  clock  line  drivers  in  the  middle  of  the  chip,  they 
were  placed  on  the  left  hand  side.  This  resulted  in  long  and  unbuffered  clock  wires. 

In  the  first  version  of  the  CPU,  the  clock  line  is  approximately  8ram  long.  We  estimated  it 
to  have  one  Ohm  of  resistance,  lOnH  of  inductance,  and  14.6pF  of  capacitance.  Although  the 
resistance  is  relatively  small,  the  inductance  and  capacitance  are  big  enough  to  cause  some  ring¬ 
ing  in  clock  line  (Figure  3-3-4(a)).  While  the  ringing  in  the  real  clock  line  will  die  down  due  to 
resistance.  Figure  3-3-4(b)  shows  that  the  initial  ringing  on  the  clock  line  clock  are  enough  to 
amplify  the  gUtches  on  the  control  lines  ldRd_L  and  IdRd  such  that  ExecRd  wiU  be  discharged 
unintentionally. 
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Figure  3-3-5  Misplaced  Well  and  Substrate  Contacts 

Instead  of  placing  the  well  and  substrate  contacts  on  the  right  side  of  the  transistor  and  connects 
to  the  power  supply  and  GND  respectively  (the  dotted  line),  they  are  placed  incorrectly  on 
the  left  side  and  connected  to  the  busses.  These  misplaced  contacts  form  diodes  between  the 
and  GND  that  will  prevent  busS  in  (a)  to  go  below  4.3V  and  busD  in  (b)  to  go  above  0.7V. 


This  problem  caused  the  Call  instruction  unable  to  save  the  return  address.  It  was  first 
solved  in  software  by  emulating  the  Call  instruction.  The  hardware  is  fixed  in  the  second  version 
of  the  CPU  chip  where  the  hazardous  circuit  is  redesigned  and  llic  clock  generator  is  moved  to  the 
center  of  the  chip  to  reduce  the  length  of  the  clock  wire.  The  important  lessons  here  is  that  one 
should  never  trust  the  CAD  tools  blindly  and  use  any  marginal  design  just  because  the  CAD  tools 
predict  it  will  work.  There  are  just  too  many  second  order  effects  you  and  the  CAD  tools  may 
have  neglected. 

Careful  design  is  still  necessary!  Despite  all  rumors,  there  is  still  no  good  substitution  for  a 
good  electrical  engineer  knowing  what  he  is  doing  and  working  very  carefully. 

3.3.2.2.  Well  Problems 

The  last  two  problems  discussed  in  Section  3.3.1  and  3.3.2. 1  are  both  caused  by  multiple 
mistakes.  In  this  section,  I  want  to  show  how  one  single  mistake  can  cause  serious  problem.  One 
of  the  supposingly  good  features  of  the  Magic  Layout  System  [SMH85]  we  used  is  that  the 
CMOS  layout  artists  do  not  have  to  worry  about  well  placemcnt-Magic  will  generate  the  wells 
automatically.  Unfortunately,  this  simplifying  assumption  also  means  the  Magic  Layout  System 


Chapter  3:  The  SPUR  CPU  Experience 


74 


does  not  extract  the  well  from  the  layout  when  it  generates  the  switch  level  and  circuit  level 
description.  Thus  any  layout  verification  tools  based  on  the  Magic  Layout  System  cannot  check 
the  well  either.  One  of  the  first  things  most  circuit  designers  will  warn  you  about  this  approach  is 
that  you  may  end  up  with  floating  wells.  However,  one  of  the  most  painful  lesson  we  learned  in 
SPUR  is  that  floating  weU  is  not  the  only  possible  problem  in  this  approach. 

One  problem  in  the  SPUR  CPU  is  the  misplaced  well  and  subtract  contacts  shown  in  Figure 
3-3-5.  In  the  N-weU  process  used  by  SPUR,  a  misplaced  well  contact  will  cause  a  node  to  stuck 
at  one  (Figure  3-3-5(a))  and  a  misplaced  substrate  contact  will  cause  a  node  to  stuck  at  zero  (Fig¬ 
ure  3-3-5(b)).  In  the  SPUR  CPU,  we  are  fortunate  that  all  misplaced  weU  contacts  are  at  redun¬ 
dant  precharge  transistors  in  the  lower  datapath.  As  illustrated  in  Figure  3-3-6,  this  enabled  us  to 
solve  the  stuck  at  1  problem  by  cutting  off  the  power  supply  to  these  precharge  devices.  The 
stuck  at  0  problem,  however,  cannot  be  solved  by  laser  cutting  and  we  were  forced  to  do  most 
primary  testing  using  a  crippled  8-bit  CPU  until  the  problem  was  fixed  in  the  second  version  of 
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Figure  3-3-6  A  Quick  Hardware  Fix  for  Misplaced  Well  Contacts 

The  misplaced  well  contacts  are  all  at  the  redundant  precharge  PMOS  transistors  at  the  lower  da¬ 
tapath.  These  precharge  transistors  are  redundant  because  busS  is  also  precharged  by  transistors 
at  the  upper  datapath.  Furthermore,  the  Vod  lines  that  supply  power  to  these  redundant  transistors 
are  connected  to  major  power  busses  on  either  side.  Therefore,  by  using  the  laser  cutting  system 
at  Information  Science  Institute  [Par87]  to  cut  off  the  supply  on  both  sides,  we  can  isolate 
these  precharging  device  without  affecting  the  CPU  function. 
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the  SPUR  CPU. 

The  floating  well  is  not  a  problem  in  the  SPUR  CPU  but  it  is  a  problem  in  the  Cache  Con¬ 
troller  chip.  The  Cache  Controller  floating  well  problem  is  discussed  here  because  it  is  quite  dif¬ 
ferent  from  what  most  people  expected  from  floating  well.  The  first  problem  come  to  most 
people’s  mind  concerning  floating  wells  is  latch  up.  In  the  Cache  Controller  case,  however,  we 
learned  that  floating  weU  can  also  be  a  problem  if  there  is  any  dynamic  storage  node  inside  the 
floating  well.  This  is  illustrated  in  Figure  3-3-7.  This  is  ironic  because  a  common  technique  to 
build  dynamic  storage  node  is  to  use  a  pair  of  pass  transistors  and  the  well  for  one  of  these 
transistors  is  likely  to  be  a  floating  well  because  it  is  usually  hard  to  place  a  well  contact  in  this 
congested  area.  As  shown  in  Figure  3-3-7,  if  the  PMOS  pass  transistor  is  in  a  floating  weU,  the 
dynamic  register  will  retain  its  value  correctly  only  if  the  well  is  above  4.3V. 

One  thing  we  learned  after  we  discoved  aU  the  well  problems  is  that  all  these  problems  can 
be  detected  by  Magic  if  we  do  some  tricks  to  the  Magic  technology  file.  In  order  to  detect  these 
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Figure  3-3-7  Floating  Well  Problem 

(a)  shows  a  dynamic  storage  device  in  which  data  is  stored  dynamically  in  the  node  Dsme  ■  G^) 
shows  how  a  PNP  transistor  is  formed  if  the  PMOS  transistor  whose  gate  is  connected  to  clock  is 
in  a  floating  well,  (c)  shows  the  equivalent  circuit.  When  clock  is  asserted  (OV),  the  dynamic  re¬ 
gister  latches  in  the  data  correctly.  Unfortunately,  when  clock  is  disasserted  (5V),  this 
dynamic  register  holds  the  value  correcdy  only  if  the  Well  node  is  above  4.3V.  Since  the  Well 
node  is  charged  to  to  4.3V  whenever  is  5V  and  will  stay  there  until  leakage  current  discharge 
it,  this  dynamic  register  will  operate  correctly  most  of  the  time. 
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errors  in  Magic,  the  designer  must  request  Magic  to  display  the  wells  explicitly.  Therefore  I  con¬ 
cluded  that  as  far  as  the  Magic  Layout  System  is  concerned,  well-independent  design  style  can  be 
dangerous,  although  it  may  seem  attractive  in  theory. 

3.3.3.  Implementation  Problems 

The  only  implementation  problem  we  have  is  that  the  backup  copies  of  all  special  registers 
(Figure  3-3-l(a))  were  implemented  incorrectly  in  dynamic  registers  instead  of  static  or  pseudo¬ 
static  registers.  Since  the  current  copy  loads  from  its  backup  whenever  a  trap  occurs,  the  backup 
copy  must  retain  the  correct  value  at  all  time.  This  fact  was  so  obvious  to  me,  the  microarchitect, 
that  I  did  not  even  bother  to  specif  it  explicitly  in  the  documentation.  The  circuit  designer  on  the 
other  hand  did  not  have  the  same  understanding  of  the  operation  and  thought  a  dynamic  register 
was  sufficient. 

This  problem  is  not  detected  during  switch  level  simulation  because  the  switch  simulation 
do  not  simulate  leakage  current  in  the  dynamic  node.  Furthermore,  since  all  our  test  programs 
have  short  run  time  (relative  to  the  real  work  load),  the  leakage  current  is  not  a  problem  either. 
This  problem  was  not  discovered  until  we  started  debugging  the  operating  system.  It  was  fixed  in 
software  by  interrupting  the  CPU  regularly  to  refresh  (read  and  write  back)  the  special  registers. 
The  lesson  here  is  that  the  microarchitect  should  specify  everything  explicitly  because  what  is 
obvious  to  him  may  not  be  obvious  to  the  logic  and  circuit  designers  who  are  looking  at  the 
design  at  a  much  lower  and  local  level. 

3.4.  The  SPUR  CPU  Technical  Lessons 

The  SPUR  CPU  design  process  and  all  the  problems  taught  us  some  valuable  lessons.  The 
technical  lessons  are  summarized  in  this  section.  The  philosophical  lessons  are  summarized  Sec¬ 


tion  3.5. 
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3.4.1.  Simulation  and  Testing  Lessons 

The  SPUR  CPU  simulation  process  consists  of  two  levels:  behavioral  level  and  switch 
level.  They  have  already  been  discussed  in  Section  3.1.3  and  Section  3.2.3,  respectively.  The 
SPUR  CPU  chip  testing  process  shown  in  Figure  3-4-1  also  consists  of  two  levels:  chip  test  and 
board  test.  The  switch  level  simulation  vectors  were  used  for  initial  chip  testing.  For  chip  debug¬ 
ging,  we  found  out  we  must  be  able  to  write  a  new  test,  run  it,  and  verify  the  results  rapidly.  This 
is  accomplished  by  automating  the  five-step  chip  testing  process. 

The  behavioral  level  diagnostics  for  the  SPUR  processor  board  (see  Processor  Behavioral 
Model,  Figure  3-1-3)  were  used  for  initial  board  test.  In  order  to  debug  the  processor  board,  we 
must  also  write  new  tests.  Although  the  board  test  for  SPUR  is  much  more  extensive  than  that 
for  RISC  n  and  SOAR,  it  is  still  not  the  ultimate  test.  One  important  lesson  we  learned  is  that  the 
ultimate  test  came  when  we  tried  to  bring  up  the  operating  system  [OCD88].  This  is  the  time 
when  we  discovered  most  of  the  problems.  It  is  interesting  to  note  that  the  operating  system  in  its 
first  Sms  of  operation  requires  the  SPUR  CPU  to  execute  more  instructions  than  the  total  number 


Figure  3-4-1  SPUR  CPU  Testing  Strategy 


The  chip  test  is  a  five-step  process:  (1)  Generate  test  vectors  on  the  SUN  work  station  by  running 
behavioral  diagnostics.  (2)  Down  load  the  vectors  onto  the  DAS.  (3)  DAS  drives  the  test  board 
and  collects  output  vectors.  (4)  DAS  sends  output  vectors  back  to  the  SUN.  (5)  Verify  output 
vectors  on  the  SUN.  After  the  CPU  and  Cache  Controller  chips  have  been  tested  independendy, 
they  are  tested  together  on  the  SPUR  processor  board.  Uniprocessor  diagnostics  are  loaded  onto 
the  memory  board  and  the  DAS  is  used  for  debugging. 
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of  instructions  the  switch  level  simulator  has  simulated  (Table  3-1-3).  This  is  unavoidable 
because  real  machines  must  run  much  faster  than  the  simulator.  It  is  not  a  major  problem  if  the 
diagnostics  are  well  chosen.  There  were,  however,  a  couple  of  important  lessons  we  learned  con¬ 
cerning  simulation  and  testing. 

3.4.I.I.  Lesson  1:  One  Size  Does  Not  Fit  All 

The  switch  level  simulator  is  twenty  times  slower  than  the  behavioral  simulator  (60 
sec/cycle  versus  3  sec/cycle).  Therefore  only  a  subset  of  the  behavioral  diagnostics  are  used  in 
switch  level  simulation.  Initially,  we  envisioned  that  the  following  process  could  be  fiilly 
automated: 

(1)  Run  the  diagnostics  in  the  behavioral  simulator  and  trace  the  input/output  ports  of  the 
CPU. 

(2)  Convert  the  traces  into  switch  level  test  vectors  for  switch  level  simulatioa 

(3)  Convert  switch  level  test  vectors  to  logic  analyzer  vectors  for  chip  testing. 

We  encountered  two  problems  in  automating  this  process.  First,  the  behavioral  simulator, 
switch  level  simulator,  and  the  logic  analyzer  all  have  different  input/output  formats.  More 
importantly,  each  initializes  the  chip  differently  and  propagates  "don’t  care"  conditions  dif- 
ferenfly.  Consequently,  we  must  edit  some  automatically  generated  test  vectors  and  examine 
whether  reported  errors  are  real  errors.  Both  tasks  are  time  consuming  and  error  prone.  The 
designers  of  different  simulators  and  the  logic  analyzer  must  work  together  to  avoid  this  problem. 
Second,  behavioral  diagnostics  and  switch  level  diagnostics  have  different  requirements. 
Behavioral  diagnostics  are  verification  diagnostics  and  you  want  them  to  be  long  and  general. 
Switch  level  and  chip  testing  diagnostics,  on  the  other  hand,  are  debugging  diagnostics-you  want 
them  to  be  short  and  specific.  We  solved  this  problem  by  building  long  verification  diagnostics 
from  short  self-testing  debugging  diagnostics. 
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3.4.I.2.  Lesson  2:  The  Danger  of  Simulation 

In  our  simulation  world,  diagnostics  are  executed  one  at  a  time:  Start  a  diagnostic,  finish  it, 
then  start  the  next  diagnostic.  This  is  not  realistic  because  in  the  real  world,  programs  seldom  mn 
from  start  to  finish  without  being  interrupted.  As  a  matter  of  fact,  the  special  registers  problem 
discussed  in  Section  3.3.1  is  one  case  where  the  SPUR  CPU  cannot  recover  from  an  interrupt.  In 
order  to  make  behavioral  simulation  more  realistic,  we  must  ensure  that  each  diagnostic  can  run 
successfully  even  if  its  execution  is  interrupted  randomly.  An  interesting  approach  is  shown  in 
Figure  3-4-2  where  the  execution  of  one  diagnostic  is  intermpted  constantly  by  another  diagnos¬ 
tic. 

This  approach  has  several  advantages.  One  major  reason  for  multiplicadve  complexity  is 
the  random  interaction  of  different  architectural  features.  This  random  interaction  is  caused  by 


Figure  3-4-2  Random  Simulation  Algorithm 

This  example  limits  the  number  of  active  diagnostics  to  two:  A  and  B.  The  software  manager  be¬ 
gins  the  simulation  by  randomly  starting  a  diagnostic,  say  diagnostic  N.  After  a  random  period  of 
time,  the  manager  interrupts  diagnostic  N’s  execution  by  starting  another  randomly  selected  diag¬ 
nostic,  say  diagnostic  M.  The  manager  then  switches  between  the  two  diagnostics  until  one  of 
the  diagnostic  is  completed.  The  manager  then  randomly  selects  another  diagnostic:  diagnostic  P. 
Each  diagnostic  must  be  self  checking  and  the  manager  should  terminate  the  simulation  as  soon 
as  any  error  is  detected. 
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random  events  such  as  traps  and  interrupts  and  can  create  a  large  number  of  CPU  states.  It  is  very 
time  consuming  (it  may  not  even  be  possible)  for  the  designer  to  visualize  all  the  possible  states 
and  write  diagnostics  to  cover  them.  However,  by  letting  diagnostics  interrupt  each  other  ran¬ 
domly,  we  can  explore  a  large  number  of  CPU  states  by  using  only  a  relatively  small  set  of  diag¬ 
nostics.  Furthermore,  due  to  random  interaction,  each  time  a  new  diagnostic  is  added  to  the  set, 
the  increase  in  CPU  states  that  can  be  tested  goes  beyond  the  checks  in  the  new  diagnostics. 

3.4.2.  The  Nature  of  Microarchitectural  Design 

The  simulation  and  testing  lessons  we  learned  are  only  part  of  the  story.  The  root  of  the 
problem  is  a  gap  in  the  computer  engineering  education.  The  term  microarchitecture,  as  defined 
in  Chapter  1,  is  the  specification  of  how  the  macroarchitecture  is  implemented  in  a  given  technol¬ 
ogy.  Microarchitectural  design  has  been  treated  more  like  an  art  than  science.  This  is  unfor¬ 
tunate  because  you  can  teach  science  but  you  cannot  teach  art!  Consequently,  the  art  of  microar¬ 
chitectural  design  is  not  well  taught  and,  as  shown  in  Figure  3-4-3,  there  is  a  gap  in  the  computer 


Microarchitectural  Design 
Mead  &  Conway  Style  VLSI  Design 


Figure  3-4-3  The  Gap  in  Computer  Engineering  Education 

At  the  highest  level.  Computer  Science  classes  are  available  for  computer  architecture.  At  the 
lowest  level.  Electrical  Engineering  classes  are  available  for  digital  circuit  design.  There  is  a  big 
gap  between  these  two  levels.  In  my  opinion  Mead  &  Conway  style  VLSI  design  class  only 
bridges  the  gap  between  these  two  levels  because  it  is  a  only  a  digital  circuit  design  class  in  in 
Computer  Science  perspective. 
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engineering  education.  This  gap  is  not  as  apparent  in  the  past  in  the  academic  world  because  most 
univereities’  VLSI  projects  are  either  driven  from  the  top-implement  architectural  innovations,  or 
driven  from  the  bottom-try  out  fast  circuit  technology.  The  only  way  to  close  this  gap  is  to  make 
the  microarchitectural  design  process  more  a  science  than  art  by  developing  a  more  systematic 
approach  to  microarchitectural  design. 

In  my  opinion,  the  key  of  making  the  microarchitectural  design  process  into  a  science  is  to 
put  more  emphasis  on  the  tradeoffs  between  performance,  resources,  and  complexity.  As  will  be 
discussed  in  Section  4.1,  one  way  to  measure  performance  is  the  T  x  /  x  C  product  where  T  is 
cycle  time,  I  is  the  number  of  instructions  it  takes  to  execute  certain  benchmark  programs,  and  C 
is  the  average  number  of  cycle  per  instruction.  Chip  area  and  transistors  count  are  two  examples 


(c) 


Figure  3-4-4  Performance  as  a  Function  of  Resources  and  Complexity 


(a)  is  a  three  dimensional  plot  of  performance  as  a  function  of  resources  and  complexity,  (b)  and 
(c)  are  the  two-dimensional  projections  of  this  design  surface  onto  the  resources  and  complexity 
axes  respectively,  (b)  shows  that  for  a  fixed  amount  of  complexity,  increase  the  amount  of 
resources  will  increase  the  performance.  Similarly,  as  shown  in  (c),  for  a  fixed  amount  of 
resources,  increase  the  amount  of  complexity  will  increase  the  performance.  In  either  case,  the 
rule  of  diminishing  return  applies.  The  RISC  argument  carries  one  step  further  than  the  rule  of  di¬ 
minishing  return.  RISC  proponents  suggest  that  as  the  complexity  gets  too  high,  the  performance 
actually  goes  down. 
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of  resources  metrics.  The  complexity  of  a  design  can  be  considered  qualitatively  as  a  measure  of 
how  hard  it  is  to  specify  and  implement  that  design.  The  number  of  cycles  of  diagnostics  and  the 
simulation  effort  are  examples  of  quantitative  complexity  metrics. 

Performance,  resources,  and  complexity  can  be  considered  as  three  independent  dimensions 
in  a  multidimensional  design  space.  With  other  variables  such  as  technology  and  designer’s  abil¬ 
ity  in  this  multi-dimensional  design  space  being  constant,  alternative  microarchitectures  are  res¬ 
tricted  to  points  on  a  three-dimensional  design  surface  shown  in  Figure  3-4-4(a).  Without  a 
high-level  design  automation  system,  a  designer  must  go  through  the  process  of  pruning  this 
design  space  by  making  trade-offs.  The  most  systematical  way  to  make  these  trade-offs  is  to  per¬ 
form  experiments  that  gives  quantitative  estimates  in  the  performance,  resources,  and  complexity 
dimensions.  A  designer  would  like  to  get  these  estimates  with  minimal  effort  and  as  early  as  pos¬ 
sible  in  the  design  process  such  that  more  alternative  microarchitectures  can  be  evaluated. 

Figure  3-4-4(a)  is  a  simplified  view  of  the  design  space  because,  in  reality,  resources  and 
complexity  are  not  completely  independent.  However,  they  are  not  as  dependent  as  most  people 
think  either.  If  resources  are  measured  in  terms  of  area  and  complexity  is  defined  as  the  degree  of 
difficulty  in  imderstanding  the  operation  of  a  module,  then  resource  and  complexity  can  be  quite 
independent.  For  example,  a  module  may  be  small  but  its  operation  can  still  be  very  difficult  to 
understand.  In  Chapter  4, 1  will  evaluate  the  different  SPUR  CPU  feahires  in  terms  of  the  perfor¬ 
mance,  resources,  and  complexity  tradeoffs.  Before  I  move  on  to  the  next  chapter,  I  like  to  list  the 
philosophical  lessons  I  learned. 

3.5.  The  SPUR  CPU  Philosophical  Lessons 

In  this  section,  we  wiU  summarize  some  of  the  philosophical  lessons  we  learned  in  design¬ 
ing  the  SPUR  CPU.  Our  experience  has  shown  that  these  apparently  trivial  lessons  may  easily  be 
forgotteiL 
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Keep  it  Simple.  The  simplest  solution  that  works  is  also  the  most  elegant  solution  because: 
(1)  unless  you  are  willing  and  able  to  use  the  highest  performance  solutions  for  all  components, 
the  overall  performance  gain  from  the  improvement  of  a  single  component  is  limited,  (2)  simple 
solutions  require  less  design  and  implementation  time  and  thus  can  make  use  of  newer  technol¬ 
ogy  that  may  negate  many  performance  advantages  of  the  complex  solution,  and  (3)  the  simplest 
solution  requires  the  least  human  designer  time  which  in  a  sense  is  the  most  limited  and  expen¬ 
sive  resource.  Consequently,  as  long  as  the  simplest  solution  meets  the  performance  goal  and  is 
within  the  resources  available  range,  the  designer  should  accept  the  solution  and  move  onto  other 
problems  waiting  for  him  to  solve.  For  example,  the  SPUR  CPU  uses  a  simple  4-phase  clocking 
scheme  that  places  a  lower  limit  on  the  CPU  cycle  time  (approximately  100ns).  This  is  acceptable 
because  the  external  bus  and  the  memory  system  cannot  run  any  faster  than  100ns. 

A  Working  Whole  is  Better  than  a  Working  Part.  A  designer  should  spend  his  time 
solving  unsolved  problems  instead  of  trying  to  find  a  better  solution  for  an  already  solved  prob¬ 
lem.  Professor  John  Ousterhout  at  Berkeley  once  said:  "The  biggest  performance  enhancement  is 
achieved  when  going  from  a  non- working  system  to  a  working  system."  This  may  sound  trivial 
but  whenever  a  designer  is  not  making  any  progress  in  solving  a  new  problem  it  can  be  very 
templing  for  him  to  go  back  to  something  he  already  understands  and  try  to  optimize  it.  For 
example,  we  did  not  attempt  to  reduce  the  size  of  the  CPU’s  master  control  PLA  any  further 
because  it  can  already  fit  nicely  into  its  assigned  space. 

The  A  in  CAD  Means  Aided.  The  CAD  tools  are  there  to  help  the  designer,  not  to  replace 
him.  The  result  can  be  catastrophic  if  the  designer  does  not  think  nor  woik  carefully  and  expects 
the  CAD  tools  to  do  all  his  woric  and  catch  all  his  foolish  mistakes.  For  example,  switch  level 
simulators  or  even  electrical  rules  checkers  cannot  detect  many  electrical  problems  such  as  cou¬ 
pling,  charge  sharing,  and  race  conditions.  They  can  only  be  avoided  by  careful  design.  Further¬ 
more  a  VLSI  designer  should  realize  that  building  his  own  simple  tools  is  the  best  way  to  define 
his  problems  for  CAD  tool  designers.  No  matter  how  simple  the  tool  he  build  is,  it  is  probably 
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still  the  best  way  to  define  the  problem.  Once  the  problem  is  better  defined,  it  can  be  explained  to 
CAD  tools  designers  who  can  then  develop  tools  that  are  more  general,  have  more  features,  and 
more  efficient  for  the  problem. 

The  Rubik’s  Cube  Analogy.  One  of  the  most  interesting  features  of  the  Rubik’s  Cube 
puzzle  is  that  each  step  in  solving  the  puzzle  usually  has  the  horrendous  effect  of  destroying  some 
results  of  previous  steps.  Similarly  the  designer  must  be  willing  to  throw  away  some  of  his  work 
that  does  not  perform  in  order  to  finish  the  design  project.  More  importantly,  if  the  designer  is 
unwilling  to  throw  away  any  of  his  work,  he  probably  will  be  unwilling  to  start  until  he  has  all 
the  answers.  Unfortunately,  in  most  if  not  aU  cases,  one  will  never  get  all  the  answers  unless  one 
starts.  For  example,  in  the  beginning  of  the  SPUR  project,  we  did  some  layout  to  estimate  the 
relative  sizes  of  various  modules.  None  of  this  layout  was  used  in  the  final  CPU. 

Keep  it  Regular.  The  designer  must  always  try  to  follow  the  same  regular  pattern.  Our 
experience  in  SPUR  is  that  whenever  we  make  an  exception  to  save  area,  power,  or  just  being 
lazy,  we  usually  regret  it  later.  For  example,  in  SPUR,  everything  in  the  behavioral  level  is 
modeled  in  N.2  [EEE85]  except  the  FPU,  which  is  modeled  in  SLANG  [Van82].  Although  the 
reasons  for  using  SLANG  have  long  been  forgotten,  none  of  us  forget  the  grief  it  caused  when  we 
tried  to  simulate  the  FPU  with  the  rest  of  the  system. 
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Chapter  4 

MICROARCHITECTURAL 

EVALUATION 

Few  things  are  harder  to  put  up  with  than  the  annoyance 

of  a  good  example.  ,,  ,  rr.  • 

Mark  Twam 


This  chapter  evaluates  various  features  of  the  SPUR  CPU  in  terms  of  their  impact  on  perfor¬ 
mance,  resources,  and  complexity.  Resources  and  complexity  are  quantified  by  sets  of  metrics. 
Since  each  feature  has  different  impact  on  resources  and  complexity,  the  metrics  used  to  quantify 
the  impact  may  be  different.  Therefore,  I  wiU  talk  about  the  metrics  for  each  feature  separately 
when  I  discuss  each  feature.  On  the  other  hand,  in  order  to  study  the  performance  impact  quanti¬ 
tatively,  I  must  develop  a  performance  model.  This  is  done  in  Section  4.1.  This  performance 
model  can  then  be  used  to  evaluate  the  performance  improvement  due  to  each  feature  by  compar¬ 
ing  the  performance  of  the  SPUR  CPU  against  the  performance  of  an  imaginary  stripped  down 
SPUR  CPU  that  does  not  have  that  feature.  The  SPUR  CPU  features  to  be  evaluated  in  this 
chapter  are:  LISP  support  in  Section  4.2,  FPU  support  in  Section  4.3,  longer  pipeline  in  Section 
4.4,  on-chip  instruction  cache  Section  4.5,  and  multiprocessing  support  in  Section  4.6.  Insights 
base  on  this  evaluation  are  summarized  in  Section  4.7. 

4.1.  The  Performance  Model 

Performance  of  a  microarchitecture  can  be  measured  independent  of  implementation  con¬ 
siderations  by  measuring  the  number  of  instructions  it  takes  to  execute  some  benchmarks  (I): 
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Performance  =  y 

However,  just  as  cache  performance  cannot  be  measured  in  terms  of  hit  rate  alone  [Hil87a] 
[Hil88],  ignoring  implementation  considerations  can  be  misleading.  In  order  to  include  imple¬ 
mentation  considerations,  microarchitectural  performance  can  be  measured  in  terms  of  the  TIC 
product  [Hen85]: 

Performance  =  j*  x7  x  C  (4- 1-1) 

where 


T  =  Cycle  time 

I  =  Number  of  instructions  it  takes  to  execute  a  benchmark 
C  =  Average  number  of  cycles  per  instruction 

Qearly  Txl  xC  *1.  Therefore,  performance  is  not  simply  a  function  of  instruction  count 
During  the  design  process,  the  microarchitect  must  make  decisions  constantly  concerning 
whether  to  include  certain  features  in  the  microarchitecture.  In  order  to  make  these  decisions 
quantitatively,  the  microarchitect  wants  to  be  able  to  use  the  performance  model  to  predict  the 
performance  improvement  due  to  each  feature  under  consideration.  This  can  be  accomplished  by 
comparing  the  performance  of  a  base  microarchitecture  without  the  feature  under  consideration 
against  the  performance  of  the  enhanced  microarchitecture  with  the  feature. 


In  the  following  discussion,  I  will  refer  to  the  feature  under  consideration  as  featurei .  I  will 
also  use  subscript  "o"  for  the  base  microarchitecture  without  feature-,  (To,  /„,  and  Co)  and  sub¬ 
script  "i"  for  the  enhanced  microarchitecture  with  feature-,  (Ji,  /,•,  and  C,).  Using  this  notation, 
the  performance  gain  and  percentage  performance  improvement  due  to  featurei  can  be  defined  as: 


T  X  /  X  C 

GAINi  =  Performance  gain  due  to /eafure,  =  x  4  x  cf 


GAINi 


i-l 


xl00% 


(4.1.2) 

(4.1.3) 


IMP,  =  Performance  improvement  (%)  due  to /eature,-  = 

The  r  and  C  terms  in  the  above  equations  take  implementation  considerations  into  account 
Therefore,  the  microarchitect  must  consider  not  just  how  the  new  feature  (featurei)  will  affect/ 
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(change  from  /„  to  A)  but  also  how  it  will  affect  T  (change  from  To  to  r,)  and  C  (change  from  C<, 
to  Ci ).  The  microarchitect  must  estimate  the  effect  on  T  and  C  based  on  his  experience— or  by  a 
systematic  approach  discussed  in  Chapter  5.  The  effect  on  7  can  be  obtained  from  the  macroarchi¬ 
tect  who  runs  benchmarks  on  the  instmction  level  simulators  for  the  enhanced  (measure  70  and 
base  architecture  (measure  7o).  Since  architectural  features  are  added  to  perform  a  specific  func¬ 
tion  more  efficiently,  another  way  to  obtain  the  effect  on  7  is  to  estimate  it  indirecUy  by  using 
Equation  4.1.4,  derived  below.  Before  I  can  explain  Equation  4.1.4, 1  must  define  the  following 
terms: 


lo  =  Number  of  instructions  it  takes  to  execute  a  benchmark  without  feature  i, 

7,-  =  Number  of  instructions  it  takes  lo  execute  a  benchmark  with  feature  i, 

Ui  =  Number  of  times  feature  i  is  used  in  the  benchmark, 

Fi  =  Frequency  of  feature  i  in  the  benchmark  = 

Mo  =  Number  of  instructions  needed  to  perform  the  desired  function  without /earure,- ,  and 

Mi  =  Number  of  instructions  needed  to  perform  the  desired  function  with /eorwe,-. 

Most  architectural  features  are  added  to  reduce  the  number  of  instructions  to  perform  cer¬ 
tain  function.  In  such  cases.  Mo  >  Mi.  The  number  of  instructions  it  takes  to  execute  a  benchmark 
without  featurei  (Jo )  can  now  be  written  in  terms  of  U,  Mi,  Mo  and  Fi : 

Jo  =  /.•  - FiYliXMi  +  FiXliXMo  =  7i  X  1  +  FiX{Mo  -  Af.)]  (4.1.4) 

Substituting  Equation  4.1.4  into  4.1.2  and  4.1.3,  we  have: 


GAINi  =  ^  X  X  [l  +  fiX(Af„  -  A7.)]  (4.1.5) 

IMPi  =  ^  X  X  [  1  +  fi  x(Mo  -  A7i)]  -  1  X  100%  (4.1.6) 

Figure  4-l-l(a)  is  a  plot  of  Equation  4.1.6  where  Mo='i  and  Mi  =  1.  This  is  the  case  where 
adding  the  new  feature  enables  one  instruction  to  perform  the  same  function  that  used  to  be  per¬ 
formed  by  three  instructions.  The  p  factor  in  Figure  4-1-1  is  defined  as  the  product  of  the  cycle 


time  ratio 


r  ^ 

To 


and  average  number  of  cycles  per  instruction  ratio 


Co 

-cr 


.  Notice  that  as  the  p 
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Figure  4-1-1  Performance  Graphs 

Each  line  in  (a)  shows  the  performance  improvement  as  a  function  of  Fi  for  Mo  =  3,  Mi  =  1,  and 
constant  p  factor,  (b)  shows  the  case  where  the  p  factor  is  a  decreasing  function  of  Fi  and  p  ~l 
when  Fi  is  small.  In  this  case  the  performance  improvement  will  follow  the  solid  curve  that  is 
close  to  p  =  1  line  when  Fi  is  smtdl.  As  Fi  increases,  the  curve  level  off  and  follows  the  lines 
with  smiler  p  factor. 


factor  gets  smaller,  the  advantage  of  the  new  feature  is  reduced  in  two  directions:  (1)  the  line  is 
shifted  down  and,  (2)  the  slope  of  the  line  decreases.  For  example,  if  the  new  feature  has  no 
effect  on  the  cycle  time  (r,-  =  r<,)  nor  on  the  average  number  of  cycles  per  instruction  (Q  =  To), 
then  p  =  1  and  there  is  a  60%  performance  gain  if  the  new  feature  is  used  30%  of  the  time.  How¬ 
ever,  if  this  feature  increases  the  cycle  time  and  the  average  number  of  cycles  per  instruction  by 
10%  (Ti  =l.lxro  and  Q  =l.lxC<,),  then  p  =  0.82  and  the  performance  gain  is  only  32%. 

Figure  4-1 -1(b)  shows  the  performance  improvement  as  a  function  of  F;  when  the  p  factor 
is  a  decreasing  function  of  Fi .  This  is  the  case  if  the  feature  you  added  is  a  new  instruction  that 
takes  longer  than  the  original  average  number  of  cycles  (C<,)  to  execute.  This  is  more  complicated 
because  the  new  average  number  of  cycles  per  instruction  (Cj)  goes  up  when  the  new  feature  (the 
new  instruction)  is  used  more  frequently  (as  Fi  increases).  When  the  frequency  of  this  instruction 
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is  small  (F;  =  0),  the  new  average  will  still  be  close  to  the  original  average  (C,  =  Co ).  However,  as 
the  frequency  of  this  instruction  increases,  the  new  average  becomes  bigger  than  the  original 


average  (C.  >  Co)  and  the  />  factor 


n  —  IJ-'xSS- 

P-IT  TT 


decreases.  In  this  case,  even  if  the  cycle  time  is 


not  affected  by  the  new  feature  (Ti=To),  you  can  enjoy  the  performance  improvement  of  p=l 
only  when  Fj  is  small. 


4.2.  LISP  Support  Evaluation 

The  SPUR  CPU  supports  LISP  by  storing  the  type  and  generation  information  in  the  8-bit 
tag  field  of  the  register  (Figure  2-1-2)  and  performing  tag  checking  in  parallel  with  the  execution 
of  the  following  instructions  (see  Appendix  A  for  a  detailed  discussion  of  the  instruction  set): 

Add,  Sub,  And,  Or,  Xor,  Sll,  Sra,  and  Sri: 

The  CPU  checks  both  operands  to  ensure  both  operands  are  32-bit  integers  (Rxnum). 

Store_40: 

The  CPU  checks  the  generations  of  the  operands  to  ensure  the  generation  boundary  is  not 
crossed. 

Cxr  and  Cxr_ro: 

The  CPU  checks  the  pointer  of  the  operand  to  make  sure  it  is  either  a  Cons  or  Nil. 
Cmp_branch  and  Cmptrap: 

If  the  branch  condition  requires  comparing  the  lower  32-bit  of  the  two  operands,  the  CPU 
checks  to  make  sure  either  both  operands  are  Fixnum  or  both  operands  are  Character. 

The  impact  of  hardware  tag  checking  on  LISP  program  perfoimance,  resources  allocation, 
and  complexity  are  evaluated  in  Section  4.2.1,  Section  4.2.2,  and  Section  4.2.3,  respectively.  The 
results  are  summarized  in  Section  4.2.4. 


Chapter  4:  Microarchitectural  Evaluation 


92 


4.2.1.  LISP  Support-Impact  on  Performance 

In  this  section,  I  will  use  the  performance  model  developed  in  Section  4.1  to  evaluate  the 
impact  of  hardware  tag  checking  on  performance  by  comparing  the  performance  of  the  SPUR 
CPU  against  an  imaginary  stripped  down  SPUR  CPU  that  does  not  support  tag  checking.  I  wiU 
use  the  subscript  "i"  for  the  SPUR  CPU  (T.-.  Q,  and  M.)  and  subscript  "o"  for  the  stripped  down 
CPU  {To ,  Co ,  and  Mo ).  In  order  to  give  the  stripped  down  CPU  the  benefit  of  the  doubt,  I  assume 
the  stripped  down  CPU  stores  the  type  and  generation  information  at  some  easy  to  access  location 
such  that  the  work  it  take  to  store  and  retrieve  this  information  is  the  same  as  reading  and  writing 
the  tag  in  the  SPUR  CPU.  Furthermore,  I  assume  that  if  a  type  or  generation  violation  occurs, 
both  the  CPU  and  the  stripped  down  CPU  wiU  handle  the  unusual  cases  similarly.  Consequently, 


Figure  4-2-1  Performance  Improvement  due  to  Tag  Checking 

(a)  shows  the  performance  gain  due  to  hardware  tag  checking  as  a  function  of  F.-.  The  best, 
{Mo  =  7,  Fi  =  0.34)  median,  {Mo  =  5,  F,-  =  0.23),  and  the  worst  arguments  {Mo  =  3,  F,-  =  0.12) 
for  having  explicit  tag  checking  are  marked  in  this  diagram,  (a)  assumes  7)  =  To  and  Ci  =Co- 

(b)  shows  the  effect  if  hardware  tag  checking  increases  the  cycle  lime  {T  >  To),  or  the  average 
number  of  cycles  to  execute  a  instruction  (C,-  >  Co),  or  both. 
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the  only  difference  is  that  the  SPUR  CPU  checks  the  type  and  generation  information  implicitly 
in  parallel  with  the  execution  of  certain  instructions  while  programs  written  for  the  stripped  down 
CPU  must  have  explicit  instructions  to  do  the  type  and  generation  checking.  When  violations 
occur,  the  SPUR  CPU  will  trap  to  the  unusual  cases  handlers  while  the  stripped  down  CPU  will 
branch  to  the  unusual  cases  handlers.  Since  the  SPUR  CPU  checks  the  tag  implicitly  whenever 
the  special  instructions  are  executed,  the  SPUR  CPU  takes  one  instruction  to  perform  tag  check¬ 
ing; 

Mi  =  1  (4.2.1) 

According  to  the  SPUR  LISP  group  [Zor89],  LISP  programs  for  the  stripped  down  CPU  take 
between  three  to  seven  instructions  to  do  the  type  or  generation  checking  explicidy: 

3  <  Afo  <  7  =>  Median  Mo  =5  (4.2.2) 

Since  the  SPUR  CPU  checks  the  tag  in  parallel  with  the  instniction’s  execution,  the  average 
number  of  cycles  to  execute  each  instructions  is  NOT  affected  by  adding  the  feature: 

C.=C,  =>  -^  =  1  (4.2.3) 

The  performance  improvement,  assuming  hardware  tag  checking  has  no  effect  on  the  cycle 
■  ' 

time  =  1  ,  can  be  estimated  using  the  performance  model  (Equation  4.1.6)  and  the  values 

t.  < 

given  by  Equation  4.2.1,  4.2.2,  and  4.2.3.  Figure  4-2-l(a)  is  a  plot  of  the  performance  improve¬ 
ment  as  a  function  off.-  for  Mo  =  3,  M<,  =  5.  and  Mo  =  7.  According  to  George  Taylor  [Tay86],  the 
percentage  of  instructions  that  require  type  and  generation  checking  is  between  12%  and  34%: 

12%^F<<34%  =>  Median  Fi=  23%  (4.2.4) 

Based  on  Equation  4.2.2  and  Equation  4.2.4,  we  have 

Median  argument  for  having  implicit  tag  checking:  Mo  =5,  f  i  =  0.23 
Best  argument  for  having  implicit  tag  checking:  Mo  =7,  Fj  =  0.34 
Worst  argument  for  having  implicit  tag  checking:  Mo  =3,  Fi  =  0.12 
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The  performance  improvement,  assuming  hardware  tag  checking  has  no  effect  on  the  cycle 

f  ' 

C  -  ... 

nor  on  the  average  number  of  cycles  per  instruction  =  1  is  predicted  in  Fig- 

1.  • 

uie  4-2-l(a)  to  be  204%,  92%,  and  24%,  respectively.  If  having  hardware  tag  checking  increases 
the  cycle  time  (Ti  >  To),  or  the  average  number  of  cycles  to  execute  a  instmction  (C,  >  C<,),  or 
both,  then  Figure  4-2- 1(b)  predicts  the  performance  improvement  to  be  less  than  the  improvement 
predicted  in  Figure  4-2- 1(a).  For  example,  the  24%  performance  gain  predicted  by  the  worst 
argument  will  be  completely  nulhfied  if  adding  hardware  tag  checking  caused  an  increase  in  the 

cycle  time  such  that  =  0.8.  As  will  be  explained  at  the  end  of  Section  4.2.3,  the  point  p  =  0.4 
can  be  considered  as  the  place  where  MIPS-X  resides  based  on  Steenkiste  s  LISP  analysis 
[StH88]. 

The  horizontal  axis  of  Figure  4-2-l(b)  is  not  labeled  simply  as  although  1  have  argued 

^  =  1  in  Equation  4.2,3.  It  is  still  labeled  as  p  =  -^  x  to  emphasis  the  fact  although 

hardware  tag  checking  does  not  affect  the  average  number  of  cycles  per  instruction  directly 
(Equation  4.2.3),  it  may  still  change  the  average  indirectly.  For  example,  if  we  do  not  implement 
hardware  tag  checking  and  somehow  can  transfer  the  effort  to  improve  the  on-chip  instruction 
cache,  this  better  instruction  cache  may  reduce  the  average  number  of  cycles  per  instruction  form 

1.8  to  1.5.  We  can  then  still  achieve  p=0.8  even  if  the  cycle  time  is  not  changed  -^=1  : 


Tq  Cq  _  To  1.5  _  n  O'!  V 

I  have  defined  the  critical  p  factor  (pcHucai)  as  the  value  ofp  at  which  performance  improve¬ 
ment  of  the  SPUR  CPU  over  the  stripped  down  CPU  is  zero.  As  shown  in  Figure  4-2-l(b),  the 
critical  p  factor  for  the  best  case  is  pcHikai  =  0.33,  the  median  case  is  pcntica/  =  0.53,  and  the  worst 
cases  is  =0.81.  Let  us  give  the  stripped  down  CPU  the  benefit  of  the  doubt  and  assume  the 
effort  we  saved  in  not  implementing  hardware  tag  checking  can  be  used  to  improve  the  cycle  time 
and  average  number  of  cycles  per  instruction  each  by  5%: 
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p  =  ^  X =  0.95  X  0.95  =  0.90 

The  performance  improvement  due  to  LISP  support  is  reduced  to  173%  for  the  best  case,  73%  for 
the  median  case,  and  12%  for  the  worst  case. 

4,2.2.  LISP  Support-Impact  on  Resources 

The  hardware  tag  checking  that  supports  LISP  requires  six  special  instmctions,  eight  extra 
lag  bits  in  the  lower  datapath,  six  extra  branch  conditions  and  four  extra  trap  conditions.  The  six 
special  instructions  are  (see  Table  A-3-1  and  Table  A-3-2  in  appendix  A  for  a  more  detail  discus¬ 
sion  of  these  instructions  and  Cache  Operations): 

(1)  LD_40  -  Load  the  32-bit  data  and  8-bit  tag  field  of  the  40-bit  register  simultaneously. 

(2)  LD_40_RO  -  Similar  to  LD_40  except  it  sends  a  special  Cache  Operation  to  the  Cache 
Controller  for  multiprocessing. 


Six  Extra 
Instructions 

Eight 

Tag  Bits 

Extra  Branch 
Conditions 

Extra  Trap 
Conditions 

Total 

Control  PLA 
Outputs 

4/54 

7% 

0/54 

0% 

0/54 

0% 

0/54 

0% 

7% 

Control  PLA 
Products 

2.2/84 

3% 

0/84 

0% 

0/84 

0% 

0/54 

0% 

3% 

0/57 

0% 

2.7/57 

4.7% 

0.1/57 

0.2% 

1.1/57 

1.9% 

7% 

0/115 

0% 

8.7/115 

7.6% 

0.1/115 

0.1% 

1.1/115 

0.9% 

9% 

Number  of 
Signal  Pins 

0/156 

0% 

8/156 

5% 

0/156 

0% 

0/156 

0% 

5% 

Table  4-2-1  Resources  Metrics  for  Hardware  Tag  Checking 

Each  column  lists  the  absolute  and  percentage  impact  of  each  sub-feature  on  the  different 
resources  metrics.  The  percentage  impact  is  calculated  by  dividing  the  absolute  impact  by  the  to¬ 
tal  value  of  that  metric  in  the  SPUR  CPU.  The  total  impact  of  these  sub-features,  that  is  the  im¬ 
pact  due  to  hardware  tag  checking,  is  the  sum  of  four  columns  and  is  shown  in  the  right  most 
column. 
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(3)  CXR  -  Special  LD_40  instruction  that  perform  LISP  pointer  type  checking  in  parallel. 

(4)  CXR_RO  -  Similar  to  CXR  except  it  sends  a  special  Cache  Operation  to  the  Cache  Con¬ 
troller  for  multiprocessing. 

(5)  RD_TAG  -  Read  the  8-bit  tag  field  of  the  register. 

(6)  WR_TAG  -  Write  the  8-bit  tag  field  of  the  register. 

The  six  extra  branch  conditions  (Table  A-3-9,  Appendix  A)  check  the  6-bit  type  tag  (Figure  2-1- 
2)  of  the  operands; 

(1)  EQ_TAG  -  CTiecks  whether  the  6-bit  type  tags  of  the  two  operands  are  equal. 

(2)  NE_TAG  -  Checks  whether  the  6-bit  type  tags  of  the  two  operands  are  not  equal. 

(3)  EQ_38  -  Checks  whether  the  32-bit  data  and  the  6-bits  type  tags  of  the  two  operands  are 
equal. 

(4)  NE_38  -  Checks  whether  the  32-bit  data  and  the  6-bits  type  tags  of  the  two  operands  are 
not  equal, 

(5)  EQ_TC  -  Checks  whether  the  6-bit  type  tag  of  the  first  operand  equals  a  six-bit  constant. 

(6)  NE_TC  -  Checks  whether  the  6-bit  type  tag  of  the  first  operand  equals  a  six-bit  constant 
The  four  extra  trap  conditions  are  (Table  A-3-8,  Appendix  A): 

(1)  LISP  Pointer  Type  Violation  -  The  pointer  is  neither  a  CONS  nor  a  NIL. 

(2)  LISP  Data  Type  1  Violation  -  Either  operand  is  not  a  FIXNUM. 

(3)  LISP  Data  Type  2  Violation  -  Both  operands  are  not  FIXNUM  or  both  operands  are  not 
CHARACTER. 

(4)  Generation  Violation  -  Generation  of  the  second  operand  is  greater  than  the  first 
operand. 

The  six  extra  instructions,  the  eight  extra  tag  bits,  the  six  extra  branch  conditions,  and  the 
four  extra  trap  conditions  have  different  impact  on  resource  allocation.  Their  different  impact 
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must  be  measured  by  different  resources  metrics.  The  left  most  column  of  Table  4-2-1  shows  the 
resources  metrics  I  selected.  Since  each  metric’s  absolute  value  has  different  dimension,  compar¬ 
ing  different  metrics’  absolute  values  is  like  comparing  apple  and  oranges.  Therefore,  I  find  it 
more  useful  to  look  at  the  dimensionless  percentage  impact  on  each  metric.  The  percentage 
impact  is  calculated  by  dividing  the  absolute  impact  by  the  total  number  of  that  metric  in  the 
SPUR  CPU  chip.  For  example,  the  eight  extra  tag  bits  increase  the  area  by  8.7mm  ^  (absolute 
impact).  Since  the  total  active  chip  area  in  the  SPUR  CPU  is  57mm2,  the  percentage  impact  is 

If  =  7.6%. 

The  first  row  of  Table  4-2-1  shows  that  the  SPUR  CPU  master  control  PLA  has  a  total  54 
outputs.  Four  of  these  54  outputs  (7%)  are  used  to  control  the  six  extra  instructions.  Since  all 
other  sub-features  does  not  affect  the  number  of  outputs  in  master  control  PLA,  this  7%  is  the 
total  impact  due  to  hardware  tag  checking.  Similarly,  the  rightmost  column  of  the  second,  third, 
fourth,  and  fifth  row  show  that  tag  checking  for  LISP  support  are  responsible  for  3%  of  the  master 
control  PLA  product  terms,  consumes  7%  of  the  total  active  area,  9%  of  the  total  transistors,  and 
5%  of  the  total  signal  pins.  Notice  that  the  six  extra  instructions’  impact  on  resource  (Column  1) 
is  mainly  at  the  Control  PLAs.  Their  impact  on  chip  area  and  transistors  count  are  minimum.  On 
the  other  hand,  the  eight  extra  tag  bits  (Column  2)  have  minimum  impact  of  the  Control  PLAs  but 
affect  the  chip  area,  transistors  count,  and  the  number  of  signal  pins.  Finally,  the  extra  branch 
and  trap  conditions’  impact  (Column  3  and  4)  is  mainly  on  the  chip  area  and  transistors  count 

4.2.3.  LISP  Support-Impact  on  Complexity 

The  complexity  due  to  the  LISP  supporting  features  can  be  quantified  by  the  effort  to  verify 
their  correctness  by  simulation.  The  LISP  supporting  features  of  the  SPUR  CPU  are  simulated  at 
both  the  behavioral  and  switch  level.  The  diagnostics  we  used  for  switch  level  simulation  are: 

(1)  cmp-tagjnsts  -  This  diagnostic  takes  470  cycles  to  verify  that  cmp_branch  instructions 
that  use  branch  conditions  involving  tag  comparison. 
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Absolute  Impact 

Percentage  Impact 

|K.M 

4,665 

4,665/55,516  =  8% 

Man-Month 
of  Effort 

0.5 

0.5/3.5  =  14% 

Table  4-2-2  Complexity  Metrics  for  Hardware  Tag  Checking 

The  first  column  lists  the  absolute  impact  on  the  complexity  metrics  due  to  hardware  tag  check¬ 
ing.  The  second  column  lists  the  percentage  impact.  The  percentage  impact  is  calculated  by  di¬ 
viding  the  absolute  impact  by  the  total  value  of  that  mettic  in  the  SPUR  CPU. 


(1)  fixnum-trap  -  This  diagnostic  takes  1086  cycles  to  verify  that  all  LISP  data  type  viola¬ 
tions  will  cause  a  trap. 

(2)  fast-reg-tags  -  This  diagnostic  takes  1169  cycles  to  verify  that  the  CPU  can  read  and 
write  the  tag  field  of  all  the  registers. 

(3)  gen-traps  -  This  diagnostic  takes  509  cycles  to  verify  that  all  generation  violation  will 
cause  a  trap. 

(4)  cxr-traps  -  This  diagnostic  takes  1 159  cycles  to  verify  that  all  LISP  pointer  type  viola¬ 
tion  will  cause  a  trap. 

(5)  trap-psw  -  This  diagnostic  takes  272  cycles  to  verify  that  aU  traps  set  the  processor 
status  woric  (Kpsw  and  Upsw)  correctly. 

The  total  number  of  cycles  and  the  man-months  of  simulation  effort  are  two  complexity 
metrics  we  can  extract  from  this  set  of  diagnostics.  Column  1  of  Table  4-2-2  is  the  absolute  value 
of  these  two  metrics.  As  I  explained  in  previous  section,  I  found  it  more  useful  to  look  at  the 
dimensionless  percentage  impact.  The  percentage  impact  of  each  metric  is  calculated  in  Column 
2.  The  total  number  of  cycles  of  diagnostics  we  run  for  the  SPUR  CPU’s  switch  level  simulation 
is  55,516  cycles.  The  total  switch  level  simulation  effort  is  3.5  man  month.  These  two  numbers 
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Pessimistic  View 

Realistic  View 

Optimistic  View 

Performance 

Worst 

Median 

Best 

Impact 

11% 

73% 

174% 

Resources 

Biggest 

Median 

Smallest 

Impact 

9% 

6% 

3% 

Complexity 

Biggest 

Median 

Smallest 

Impact 

14% 

11% 

8% 

Table  4-2-3  Three  Different  Views  of  the  Tradeoffs 

The  pessimistic  view  uses  the  smallest  performance  impact  and  the  biggest  resource  and  com¬ 
plexity  impact.  The  realistic  view  uses  the  median  performance  impact,  median  resource  impact, 
and  median  complexity  impact.  The  optimistic  view  uses  the  biggest  performance  impact  and  the 
smallest  resources  and  complexity  impact 


are  used  in  Column  2  to  calculate  the  percentage  of  the  total  effort 
4.2,4.  LISP  Support-Impact  Summary 

The  impact  of  hardware  tag  checking  on  LISP  program  performance,  resources  allocation, 
and  complexity  are  illustrated  in  Figure  4-2-1,  Table  4-2-1,  and  Table  4-2-2,  respectively.  The 
results  of  these  figure  and  tables  are  summarized  in  Table  4-2-3,  giving  three  different  views  of 
the  performance,  resources,  and  complexity  tradeoffs.  These  three  views  of  the  tradeoffs  are 
shown  graphically  in  Figure  4-2-2.  The  median  and  optimistic  views  both  seems  to  indicate  the 
hardware  tag  checking  is  a  good  feature  because  the  percentage  improvement  in  performance  is 
big  while  the  percentage  increases  in  resources  and  complexity  are  relatively  small.  Furthermore, 
as  illustrated  in  Figure  4-2-l(b),  both  the  best  and  median  performance  improvement  arguments 
have  a  relatively  small  critical  p  factors  (pcriticai=^-5'i  for  the  median  case  and  Pcntieai  =0-33  for  the 
best  case).  In  other  words,  even  if  the  cycle  time  or  the  average  number  of  cycles  per  instruction 
or  both  are  affected  moderately  by  having  hardware  tag  checking,  the  SPUR  CPU  having  tag 


checking  will  still  come  up  a  winner. 
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A  Performance 


+100%r+.73% 


Performance 

+174% 


^Resources  Complexity 

(b)  Median 


Resources  Complexity 

(c)  Optimistic 


Figure  4-2-2  Three  Different  Views  of  the  Tradeoffs 

(a)  shows  the  pessimistic  view  which  uses  the  smallest  performance  impact  and  the  biggest 
resources  and  complexity  impact,  (b)  shows  the  median  view  which  uses  the  median  perfor¬ 
mance  impact,  median  resources  impact,  and  median  complexity  impact  (c)  shows  the  optimis¬ 
tic  view  which  uses  the  biggest  performance  impact  and  the  smallest  resources  and  complexity 
impact.  For  clarity,  logarithmic  scale  is  used  for  the  performance  axis. 


Finally,  the  performance  analysis  shown  in  Figure  4-2-1  can  also  be  used  to  compare  the 
LISP  performance  between  the  SPUR  CPU  and  the  MIPS-X.  Both  the  SPUR  CPU  and  the 
MIPS-X  are  RISC  style  load-store  processors  aim  for  single  cycle  execution.  The  MIPS-X,  how¬ 
ever,  does  not  have  hardware  tag  checking.  The  MIPS-X  has  a  cycle  time  of  50ns  and  its  average 
number  of  cycles  per  instruction  is  also  lower  than  the  SPUR  CPU  mainly  due  to  its  higher  inter¬ 
nal  instruction  cache  hit  rate.  Therefore  for  a  first  order  approximation,  MIPS-X  can  be  con¬ 
sidered  as  a  stripped  down  SPUR  CPU  that  does  not  have  hardware  tag  checking  but  have  a  faster 
cycle  time  and  lower  number  of  cycles  per  instmction.  More  specificly,  assuming  the  MIPS-X 
cycle  time  is  50ns  and  the  average  number  of  cycle  per  instmction  is  1.2  (80%  better  than  the 
SPUR  CPU),  we  have: 


p-iT^~cr 


50 

w 


xU-  =  0.40 
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The  performance  improvement  due  to  LISP  support  is  now  reduced  to  22%  for  the  best  case, 
-23%  for  the  median  case,  and  -50%  for  the  worst  case.  In  other  words,  for  the  best  case  (SPUR 
CPU’s  point  of  view),  the  MIPS-X  performance  is  only  82%  (1/1.22)  of  the  SPUR  CPU’s  perfor¬ 
mance.  However,  for  the  median  and  worst  case,  the  MlPS-X  is  30%  (1/0.77)  and  100%  (1/0.5) 
faster.  Incidently,  these  numbers  agrees  with  Steenkiste’s  analysis  [StH88]. 

These  numbers,  however,  should  not  be  used  to  draw  the  conclusion  that  hardware  tag 
checking  is  a  bad  idea.  It  will  be  unfair  because  hardware  tag  checking  should  not  be  blamed  for 
the  SPUR  CPU’s  relatively  slow  cycle  time  (100ns)  and  small  internal  instruction  cache  (128 
instructions).  The  SPUR  CPU  cycle  time  is  limited  by  system  considerations,  conservative  circuit 
design,  and  the  conservative  4-phase  non-overlap  clocking  scheme.  The  main  reason  why  the 
SPUR  CPU  has  a  small  internal  instruction  cache  is  that  we  try  to  be  conservative  and  build  the 
instruction  cache  using  the  relatively  large  static  RAM  cells  that  were  used  in  the  register  file.  A 
better  conclusion  is  that  the  MIPS-X  designers,  who  are  willing  to  take  more  risks,  used  more 
aggressive  circuit  designs  to  lower  the  cycle  time  and  increase  the  size  of  the  internal  instruction 
cache.  Under  most  circumstance,  these  improvements  are  enough  to  offset  the  SPUR  CPU’s 
hardware  tag  checking. 

4J.  FPU  Support  Evaluation 

The  SPUR  CPU  supports  floating  point  arithmetic  by  a  coprocessor-the  Floating  Point  Unit 
(FPU).  The  FPU  is  connected  to  the  SPUR  CPU  via  a  parallel  coprocessor  interface  [HaK86]. 
Detailed  discussions  of  coprocessor  interface  and  FPU  design  can  be  found  in  [Han88]  and 
[Bos88],  respectively.  The  floating  point  instructions  supported  by  the  FPU  are: 

(1)  F_ADD  -  Floating  point  add. 

(2)  F_SUB  -  Floating  point  subtract 

(3)  F_MUL  -  Floating  point  multiply. 
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(4)  F_DIV  -  Floating  point  divide. 

(5)  F_CMP  -  Floating  point  compare. 

(6)  F_MOV  -  Floating  point  move. 

(7)  F_ABS  -  Find  the  absolute  value  of  a  floating  point  number. 

(8)  F_NEG  -  Negate  a  floating  point  number. 

(9)  CVTS  -  Convert  a  double  precision  number  to  single  precision. 

(10)  CVTD  -  Convert  a  single  precision  number  to  double  precision. 

(11)  SYN  -  Synchronize  the  CPU  and  the  FPU. 

The  CPU  treat  all  these  instructions  as  NOOP  when  the  FPU  is  disabled  and  treat  them  as 
illegal  instructions  when  FPU  is  enabled.  F_ADD,  F_SUB,  F_CMP,  F_MUL,  and  F_DIV  can  be 
considered  as  major  FPU  instruction  because  it  is  the  FPU’s  goal  to  provide  the  ADD,  SUB, 
MUL,  DIVIDE  and  CMP  operations.  The  other  FPU  instructions  can  be  considered  as  supporting 
instructions  because  these  instructions  are  provided  to  make  the  FPU  operate  more  efficiently. 
Section  4.3.1  will  focus  on  the  perfonnance  impact  of  the  major  FPU  instructions.  Section  4.3.2, 
and  Section  4.3.3  discuss  FPU  support’s  impact  on  resources  and  complexity  in  the  SPUR  CPU’s 
perspective.  Section  4.3.4  summarizes  the  results. 

4.3.1.  FPU  Support-Impact  on  Performance 

This  section  I  will  use  the  performance  model  developed  in  Section  4.1  to  evaluate  the 
impact  of  FPU  support  on  floating  point  intensive  program’s  performance  by  comparing  the 
SPUR  CPU  to  an  imaginary  stripped  down  SPUR  CPU  that  does  not  support  floating  point  opera¬ 
tion.  As  before,  I  use  the  subscript  "i"  for  the  SPUR  CPU  (Tj.  Q,  and  Mi)  and  subscript  "o"  for 
the  stripped  down  CPU  (r„,  Co,  and  Mo).  As  far  as  the  SPUR  CPU  is  concerned,  each  floating 
point  operation  is  supported  by  a  floating  point  instruction  to  be  executed  by  the  FPU,  therefore: 
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Precision 

Execution 
Time  (us) 

Number  of 
Cycles  (Nrisc) 

Approx.  Number 
of  Instruction  (Mrisc) 

s 

63 

472.5 

80 

Add/Sub 

D 

83 

622.5 

100 

s 

110 

825 

140 

Multiply 

D 

680 

5100 

850 

s 

191 

1432.5 

240 

Divide 

D 

712 

5340 

900 

Table  4-3-1  RISC  I  Floating  Point  Operations  Measurements 

Column  2  are  the  execution  time  of  the  various  floating  point  operations.  Using  the  numbers  in 
Column  2  and  Equation  4.3.2,  we  can  calculate  the  numbers  in  Column  3.  Then  using  the 
numbers  in  Column  3  and  Equation  4.3.3,  we  can  calculate  the  numbers  in  Column  4. 


Mi  =  1  (4.3.1) 

On  the  other  hand,  in  the  stripped  down  CPU,  the  floating  point  operations  must  be  per¬ 
formed  by  floating  point  routines.  The  execution  time  of  various  floating  point  routines  on  a 
RISC  I  simulator  running  at  (7.5/4)MHz  were  measured  by  Sippel  [Sip82].  I  have  divided  the 
7.5MHz  quoted  in  Sippel’s  report  by  four  because  RISC  I  has  a  four-phase  clock  and  7.5  MHz  is 
the  frequency  between  phases.  Since  we  know  the  average  number  of  cycles  per  instruction  for 
RISC  I  is  approximately  1.5  {Cmc  =  1-5).  we  can  calculate  the  approximate  number  of  instmc- 
tions  RISC  I  takes  to  emulate  the  various  floating  point  operations  (MkiscO  by  using  Equation 
4.3.2  and  4.3.3.  The  results  are  summarized  in  Table  4-3-1: 


where 


Nrisc 


=  ETrisC  X 


Af/f/sc 


—  =  ETrisC  X 

iRlSC 

NrISC  _  NrisC 
Crisc  1.5 


7.5 

T" 


MHz 


Nrisc  =  Number  of  cycles  to  execute  the  floating  point  operation  in  RISC  I 
ETrisc  =  Execution  time  of  the  floating  point  operation  in  RISC  I 


(4.3.2) 

(4.3.3) 
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4 

Trisc  =  Cycle  time  of  RISC  I  simulator  =  ^ 

Mrisc  =  Number  of  instructions  to  execute  the  floating  point  operation  in  RISC  I 
Crisc  =  Average  number  of  cycles  per  instruction  in  RISC  I  =1.5 
Assume  the  number  of  instructions  to  execute  the  various  floating  operations  in  the  stripped 
down  CPU  are  similar  to  those  in  RISC  I  and  the  number  of  operations  to  execute  a  floating  point 
compare  is  similar  to  that  of  floating  point  subtract,  we  have: 

80  <  Mo  (Add/Sub/Cmp)  <  100 
140  <  Mo  (Multiply)  <  850 
240  <  Mo  (Divide)  <  900 

The  coprocessor  interface  that  connects  the  SPUR  CPU  and  the  FPU  allows  them  to  ojjerate 
together  in  two  different  modes: 

Sequential  Mode 

After  issuing  a  FPU  instruction,  the  CPU  must  wait  until  the  FPU  finishes  before  continuing 
its  operation.  Since  it  takes  the  FPU  longer  to  execute  any  FPU  instruction  than  the  CPU 
takes  to  execute  an  integer  instruction,  the  average  number  of  cycles  per  instruction  will 
increase  when  a  FPU  instruction  is  executed  (Q  >  Co ). 

Parallel  Mode 

After  issuing  a  FPU  instruction,  the  CPU  continues  to  execute  integer  instructions  and  will 
stall  only  if  it  encounters  another  FPU  instruction  and  the  FPU  is  still  busy  from  a  previous 
FPU  instruction.  For  the  best  scenario,  there  are  enough  integer  instructions  in  between  FPU 
instructions  and  the  average  number  of  cycles  per  instruction  will  NOT  increase  when  FPU 
instruction  is  executed  (C,-  =  Co). 

In  order  to  derive  the  equation  that  relates  Ci  and  Co  for  the  sequential  and  parallel  mode,  I  need 
to  define  the  following  terras: 

Ej  =  Number  of  cycles  to  execute  instruction  type  j 
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Fj  =  Frequency  of  instruction  type  j 

Assume  we  have  N  types  of  CPU  (integer)  instructions  and  P  type  of  floating  point  instructions, 
then  the  average  number  of  cycles  per  instruction  for  the  SPUR  CPU  operating  in  sequential 
mode  with  the  FPU  is  C, : 

Ci  =  j\Fj  FjXEj  (4.3.4) 

For  the  stripped  down  CPU  that  does  not  support  floating  point  instructions,  the  average  number 
of  cycles  per  instmction  is  Co  '■ 

(4.3.5) 

The  term  F,  (j=l,2,3  ...  N)  is  used  in  Equation  4.3.5  to  emphasis  the  fact  that  the  frequency 
of  the  integer  instructions  may  change  when  floating  point  instructions  are  eliminated.  Let  us 
assume  the  changes  in  frequencies  are  small  then: 


Pj 

Min 

Max 

Ej 

Pi 

Min 

xEj 

Max 

Mj 

Min 

Max 

Fj*Mj 

Min  Max 

FADD 

2.4% 

3.3% 

mm 

0.096 

0.132 

80 

100 

1.92 

3.30 

FSUB 

1.7% 

2.4% 

mm 

0.068 

0.096 

80 

100 

1.36 

2.40 

FCMP 

1.4% 

1.9% 

mm 

0.056 

0.076 

80 

100 

1.12 

1.90 

FMUL 

3.2% 

4.5% 

mm 

0.224 

0.315 

140 

850 

4.48 

38.25 

FDIV 

1.3% 

1.9% 

19 

0.247 

0.361 

240 

900 

3.12 

17.10 

Total 

10.0% 

14.0% 

- 

0.691 

0.980 

- 

- 

12.0 

62.95 

Table  4-3-2  Impact  of  FPU  Support  on  Performance 

Every  column,  except  Column  Ej,  is  divided  into  two  sub-columns  that  corresponds  to  the 
minimum  and  maximum  values  of  that  column.  The  minimum  sub-column  of  Column  Fj  x  Ej  is 
calculated  from  (Min  Fj)xEj.  The  maximum  sub-column  of  Column  Fj  x  Ej  is  calculated  from 
(MaxFy)x£^.  The  minimum  sub-column  of  Column  FjXMj  is  calculated  from 
(MinFy)  X  (Min  Afy).  The  maximum  sub-column  of  Column  FjxMj  is  calculated  from 
(MaxFy)  X  (Max  Afy). 
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(«•«) 

Consequently,  by  combining  Equation  4.3.4, 4.3.5,  and  4.3.6,  we  have  for  sequential  mode: 

(4.3.7) 

Similarly  for  parallel  mode: 

Ci  =  Co  +  (1  -  PORp^)  X  ^  Fy  X  Ej  (4.3.8) 

where 

PORpar  =  Portion  of  FPU  operations  in  parallel  with  CPU  operations 
Table  4-3-2  summarized  the  typical  values  for  the  frequency  of  various  floating  point 
operations-Fy  [Pat89]  [Tay89],  the  number  of  cycles  it  lakes  the  SPRU  FPU  to  execute  these 
operations-Fy  [Bos88],  and  the  number  of  instructions  the  stripped  down  CPU  takes  to  emulate 


Figure  4-3-1  Performance  Improvement  due  to  FPU  Support 

(a)  and  (b)  show  the  performance  improvement  as  a  function  of  the  frequency  of  the  coprocessor 
(FPU)  instructions  (F,  )  and  the  portion  of  FPU  insuuctions  that  can  be  executed  in  parallel  with 
CPU  instructions  (PORpar)- 1  have  assumed  the  average  number  of  cycles  per  CPU  instruction  to 
be  1.5  for  (a)  and  2.0  for  (b).  (c)  shows  the  effect  on  the  best  and  worst  case  if  supporting  the 
FPU  degrades  the  CPU  cycle  time  (T,  >  To). 
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these  operations-Af;  (Table  4-3-1).  The  products  Fj  xEj  and  FjxMj  are  also  calculated  in  this 
table.  The  numbers  in  Table  4-3-2  can  be  used  to  calculate  the  numbers  needed  by  the  perfor¬ 
mance  model  (Equation  4.1.6)  using  the  following  formulas: 

X  Min  Fy  <  Fi  <  X  Pj 

=>  10%  <  Fi  <  14%  =>  Median  F,-  =  12%  (4.3.9) 

Y  Min  (Fy  ^Mj)  ^  ^  E  Max  (Fy  x  Afy) 

SMSTTy -  -  -  XMaxF; 

=>  120  <  Afo  <  450  =>  Median  Afo  =  285  (4.3.10) 

2;Min(Fy  x£y)  <  ^Fy  x£y  <XMax(Fyx£y) 

=>  0.691  <  Median]^  Fy  X  £y  =  0.836  (4.3.11) 

In  the  formulas  above,  the  summation  indexes  j  =N+1  and  j  =  N+P  are  dropped  for  clarity. 
The  values  given  by  Equation  4.3.11  are  used  in  Equation  4.3.8  to  calculate  the  effective  Q  in 
terms  of  C.  These  numbers  can  then  be  used  with  the  performance  model  (Equation  4.1.6)  to  cal¬ 
culate  the  performance  improvement  due  to  the  FPU  supporting  features.  This  is  shown  in  Figure 
4-3-1  as  a  function  of  the  frequency  of  FPU  instructions  (F,  ),  the  portion  of  FPU  instructions  that 

can  be  executed  in  parallel  with  CPU  instructions  (FOFpar)*  the  cycle  time  ratio  . 

k  •< 

Each  line  in  Figure  4-3- 1(a)  and  (b)  assume  a  fixed  frequency  of  FPU  instructions  (10%, 
12%  and  14%)  and  a  fixed  average  number  of  cycles  per  CPU  insmiction-1.5  in  (a)  and  2.0  in 
(b).  Notice  that  Figure  4-3-l(a)  and  Figure  4-3-l(b)  predict  the  same  amount  of  performance 
improvements  when  PORpar  =  1.0  because  all  FPU  instructions  are  executed  in  parallel  with  the 

CPU  instructions.  Consequently,  ^  =  1  at  PORpar  =  1  for  both  graphs  (a)  and  (b).  Assume  the 

portion  of  FPU  instructions  that  can  be  executed  in  parallel  with  the  CPU  integer  instructions  is 
between  40%  and  80%,  we  have  the  following  best  and  worst  case  scenarios: 

Worst  Case: 

Frequency  of  FPU  instruction  is  only  10%  (Fi  =  0.10),  the  average  number  of  instmctions 
the  stripped  down  CPU  takes  to  emulate  the  FPU  operations  is  120  (Af„  =  120),  and  the 
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average  number  of  cycles  per  CPU  instruction  is  1.5  (C<,  -  1.5). 

Best  Case: 

Frequency  of  FPU  instruction  is  14%  (F;  =0.14),  the  average  number  of  instructions  the 
stripped  down  CPU  takes  to  emulate  the  FPU  operations  is  450  {Mo  =  450),  and  the  average 
number  of  cycles  per  CPU  instruction  is  2.0  (Co  =  2.0). 

Figure  4-3- 1(a)  predicts  even  at  the  worst  case,  the  FPU  support  improves  the  floating  point 
performance  by  900%-the  SPUR  CPU  is  10  times  faster  than  the  stripped  down  CPU.  Figure  4- 
4- 1(b)  predicts  at  the  best  case,  the  FPU  support  improves  the  floating  point  performance  by 
5700%-the  SPUR  CPU  is  58  times  faster  than  the  stripped  down  CPU.  Both  these  two  cases 
have  assumed  the  FPU  support  has  no  effect  on  the  CPU  cycle  time.  However,  as  shown  in  Fig¬ 
ure  4-3-l(c),  even  if  the  cycle  time  is  degraded  by  10%,  the  performance  improvement  is  still 
respectable:  5100%  for  the  best  case  and  700%  for  the  worst  case. 

4  J.2.  FPU  Support-Impact  on  CPU  Resources 

In  addition  to  the  1 1  floating  point  instructions  listed  in  the  beginning  of  this  section,  the 
FPU  support  capabilities  also  requires  eight  load  instructions,  four  store  instructions,  a  coproces¬ 
sor  interface,  two  extra  branch  conditions,  and  one  trap  condition.  As  far  as  the  SPUR  CPU  is 
concerned,  these  extra  FPU  load  and  store  instnictions  are  similar  to  the  CPU  load  and  store 
instructions  except  that  the  FPU  will  receive  or  provide  the  data.  The  SPUR  CPU  is  still  responsi¬ 
ble  for  generating  the  effective  address  and  send  the  proper  Cache  Operation  code  to  the  Cache 
Controller.  The  two  extra  branch  conditions  are: 

(1)  FPU_TRUE  -  Branch  if  the  floating  point  compare  instruction  results  in  a  true  condition. 

(2)  FPU_FALSE  -  Branch  if  the  floating  point  compare  instruction  results  in  a  false  condi¬ 
tion. 


The  extra  trap  condition  is: 
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FPU 

Instructions 

Coprocessor 

Interfaces 

Total 

3/54 

6% 

0/54 

0% 

6% 

Control  PLA 
Products 

4/84 

5% 

0/84 

0% 

5% 

Chip  Area 
(mm  X  mm) 

0/57 

0% 

6/57 

10% 

10% 

Transistors 
(x  1(X)0) 

0/115 

0% 

0.4/115 

0% 

0% 

Number  of 
Signal  Pins 

32/156 

21% 

5/156 

3% 

24% 

Table  4-3-3  Resources  Metrics  for  FPU  Support 

Each  column  lists  the  absolute  and  percentage  impact  of  each  sub-feature  on  the  different 
resources  metrics.  The  percentage  impact  is  calculated  by  dividing  the  absolute  impact  by  the  to¬ 
tal  value  of  that  metric  in  the  SPUR  CPU.  The  total  impact  of  these  sub-features,  that  is  the  im¬ 
pact  due  to  FPU  support,  is  summarized  in  the  right  most  column. 


(1)  FPU_EX  -  FPU  exception. 

The  impact  of  FPU  support  on  resources  are  quantified  into  several  resources  metrics  in 
Table  4-3-3.  The  impact  of  the  two  extra  branch  conditions  and  the  one  extra  trap  condition  on 
resources  is  so  small  that  it  is  not  listed  in  the  table.  The  number  of  transistors  consumed  by  the 
FPU  support  is  also  negligible.  Area  consumption  is  mainly  due  to  the  coprocessor  interface 
which  include  suspension  logic  in  the  master  control,  a  special  register  FpuPC,  and  routing  the 
instruction  busi  onto  the  output  pads.  FPU  support  also  consumes  37  of  the  total  156  signal  pads. 
The  SPUR  CPU  must  broadcast  the  instruction  currenQy  being  fetched  via  32  of  these  pads.  This 
is  the  only  way  the  FPU  can  find  out  the  current  instmction  because  the  internal  instruction 
caches  makes  the  instruction  currently  being  fetched  invisible  to  the  outside  world. 

4 J.3.  FPU  Support-Impact  on  CPU  Complexity 

The  complexity  due  to  the  FPU  supporting  features  can  be  quantified  by  the  simulation 
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Absolute  Impact 

Percentage  Impact 

Cycles  of 
Diagnostics 

1,543 

1,543/55.516  =  3% 

Man-Month 
of  Effort 

0.5 

0.5/3.5  =  14% 

Table  4-3-4  Complexity  Metrics  for  FPU  Support 

The  first  column  lists  the  absolute  impact  on  the  complexity  metrics  due  to  FPU  support.  The 
second  column  lists  the  percentage  impact  The  percentage  impact  is  calculated  by  dividing  the 
absolute  impact  by  the  total  value  of  that  metric  in  the  SPUR  CPU. 


effort.  The  diagnostics  we  used  for  switch  level  simulation  are: 

(1)  fpu-fpu_busy  -  This  diagnostic  takes  190  cycles  to  verify  that  the  CPU  wiU  stall  suspend 
the  pipeline  when  the  FPU  is  busy  and  the  CPU  wants  to  issue  a  new  FPU  instruction. 

(2)  fpu-enable_fpop  -  This  diagnostic  takes  142  cycles  to  verify  that  the  CPU  will  treat  FPU 
operations  instructions  as  illegal  instruction  when  the  FPU  is  disabled  and  treat  them  as 
FPU  instructions  when  the  FPU  is  enabled. 

(3)  fpu-enablejd  -  This  diagnostic  takes  144  cycles  to  verify  that  the  CPU  will  treat  FPU 
load  and  store  instructions  as  illegal  instraction  when  the  FPU  is  disabled  and  treat  them 
as  FPU  instructions  when  the  FPU  is  enabled. 

(4)  fpu-fpc  -  This  diagnostic  takes  148  cycles  to  verify  that  the  address  of  the  last  FPU 
instruction  issued  by  the  CPU  is  stored  in  the  special  register  FpuPC. 

(5)  fpu-serial  -  This  diagnostic  takes  173  cycles  to  verify  that  the  CPU  and  FPU  can  operate 
in  sequential  mode. 

(6)  fpu-sync  -  This  diagnostic  takes  142  cycles  to  verify  that  the  CPU  and  FPU  operations 
can  be  synchronized  by  the  SYNC  instruction. 
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(7)  fpu-cpu_trap  -  This  diagnostic  takes  153  cycles  to  verify  that  the  FPU  operation  can 
survive  a  CPU  trap. 

(8)  fpu-fpujrap  -  This  diagnostic  takes  144  cycles  to  verify  that  the  FPU  operation  can 
interrupt  the  CPU  operation  via  a  FPU  exception. 

(9)  fpu-cmp  -  This  diagnostic  takes  146  cycles  to  verify  that  the  CPU  can  correctly  execute 
a  cmp_branch  that  uses  the  FPU  branch  conditions. 

(10)  fpu-allopcodes  -  This  diagnostic  takes  161  cycles  to  verify  that  the  CPU  can  correctly 
identify  all  FPU  instructions. 

The  total  number  of  cycles  and  the  man-month  of  simulation  effort  are  two  complexity 
metrics  we  can  extracted  from  this  set  of  diagnostics.  These  are  summarized  in  Table  4-3-4. 

4J.4.  FPU  Support-Impact  Summary 
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Figure  4-3-2  Three  Different  Views  of  the  Tradeoffs 

(a)  shows  the  pessimistic  view  which  uses  the  smallest  performance  impact  and  the  biggest 
resources  and  complexity  impact,  (b)  shows  the  median  view  which  uses  the  median  perfor¬ 
mance  impact,  median  resources  impact,  and  median  complexity  impact  (c)  shows  the  optimis¬ 
tic  view  which  uses  the  biggest  performance  impact  and  the  smallest  resources  and  complexity 
impact.  For  clarity,  the  performance  axis  is  on  a  logarithmic  scale. 
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The  FPU  support’s  impact  on  performance  (Figiue  4-3-1),  resources  (Table  4-3-1),  and 
complexity  (Table  4-3-2)  can  be  summarized  in  the  following  inequalities: 

700%  <  Performance  Impact  <  5100%  =>  Median  Performance  Impact  =  2900% 

5%  <  Resources  Impact  <  24%  =>  Median  Resource  Impact  =  14.5% 

3%  <  Complexity  Impact  <  14%  =>  Median  Complexity  Impact  =  8.5% 

In  Table  4-3-3,  the  FPU  support’s  impact  on  transistors  is  0%.  For  conservative  analysis, 
this  is  not  used  as  the  minimum  resources  impact.  Instead,  the  next  smallest  increase  (5%)  is 
used.  These  numbers  indicate  the  use  of  the  coprocessor  FPU  via  a  coprocessor  interface  to  sup¬ 
port  floating  point  operations  is  a  good  idea  because  the  FPU  support  only  increases  the  resource 
and  complexity  by  a  small  amount  but  improves  floating  point  intensive  programs  drastically. 
This  is  of  course  only  the  CPU’s  perspecdve.  A  lot  of  resources  and  complexity  not  included  in 
this  tradeoffs  analysis  are  involved  in  the  design  and  implementation  of  the  coprocessor  FPU 
[Bos88]  and  the  coprocessor  interface.  Furthermore,  the  performance  improvement  is  for  floating 
point  intensive  programs  only.  If  a  program  does  not  have  any  floating  point  operations,  the 
coprocessor  interface  and  the  FPU  will  not  improve  the  program’s  performance. 

4.4.  Extra  Pipeline  Stage  Evaluation 

The  only  difference  between  the  SPUR  CPU  pip)eline  and  the  RISC  II  pipeline  is  shown  in 
Figure  4-4-1.  The  SPUR  CPU  pipeline  has  an  extra  memory  access  stage  (Mem)  that  allows  the 
SPUR  CPU  to  execute  LOAD  without  suspending  the  pipeline.  The  extra  pipeline  stage’s  impaa 
on  performance,  resources,  and  complexity  are  evaluated  in  Section  4.4.1, 4.4.2,  and  4.4.3  respec¬ 
tively.  The  results  are  summarized  in  Section  4.4.4. 

4.4.1.  Extra  Pipe  Stage-Impact  on  Performance 

This  section  evaluates  the  extra  pipeline  stage’s  impact  on  performance  by  comparing  the 
performance  of  the  SPUR  CPU  against  an  imaginary  stripped  down  SPUR  CPU  that  uses  the 
RISC  II  three  stage  pipeline.  I  will  use  subscript  "i"  (T,-,  Q,  Mi)  for  the  SPUR  CPU  and  the 
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Figure  4-4-1  RISC  H  Pipeline  vs.  SPUR  CPU  Pipeline 

This  figure  assumed  the  external  cache  will  provide  the  data  within  one  cycle.  Under  this  as¬ 
sumption,  the  SPUR  CPU  4-stage  pipeline  will  execute  LOAD  without  pipeline  suspension. 
However,  since  the  execution  stage  (Exec)  of  the  instruction  following  the  LOAD  (12)  overlaps 
the  memory  access  stage  (Mem)  of  the  LOAD,  12  cannot  use  the  LOAD  s  destination  register. 
We  call  LOAD  a  delay  instruction  and  12  the  delay  slot  On  the  other  hand,  the  RISC  n  pipeline 
is  susp)ended  for  one  cycle  whenever  LOAD  is  executed.  Due  to  this  one  cycle  suspension,  the 
execute  stage  (Exec)  of  12  is  delayed  until  after  the  data  access  phase  of  the  LOAD  and  12  can 
uses  the  LOAD’S  destination  register. 


subscript  "o"  (Jo,  Co,  Mo)  for  the  stripped  down  CPU.  In  the  SPUR  CPU  4-stage  pipeline,  the 
instruction  after  the  LOAD  (12  in  Figure  4-4-1)  cannot  use  the  destination  register  of  the  LOAD. 
This  delay  slot  must  be  filled  by  a  NOOP  unless  we  can  find  a  instruction  that  does  not  use  the  the 
destination  register  of  the  LOAD.  Therefore  in  the  worst  case,  the  load  function  is  performed  by 
two  instructions— the  LOAD  and  the  NOOP  in  the  delay  slot: 

Mi=2-PORfui  (4A1) 

PORfiu  =  Portion  of  the  delay  slot  is  filled  by  an  useful  instruction 
On  the  other  hand,  in  the  3-stage  pipeline,  the  instruction  immediately  after  the  LOAD  can 
use  the  destination  register  of  the  LOAD.  Therefore,  the  number  of  instructions  it  takes  to  per¬ 
form  the  load  function  is  just  one-the  LOAD  instruction: 

Mo  =  1  (4.4.2) 

The  average  number  of  cycles  per  instruction  for  the  SPUR  CPU  (Ci)  is  different  from  the 
average  number  of  cycles  per  instruction  for  the  stripped  down  CPU  (Co)  because  the  3-stage 
pipeline  must  be  suspended  for  data  access  whenever  a  LOAD  is  executed.  In  order  to  look  at  this 
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difference  quantitatively,  I  define  the  following  terms; 

Ntoad  4  =  Average  number  of  cycles  to  execute  LOAD  in  the  4-stage  pipeline, 

Nioad  3  =  Average  number  of  cycles  to  execute  LOAD  in  the  3-stage  pipeline, 

Nothtr  =  Average  number  of  cycles  to  execute  other  instructions  in  either  pipeline, 

Utoad  =  Number  of  LOAD  instruction  in  the  benchmark, 

li  =  Number  of  instructions  it  takes  the  SPUR  CPU  to  execute  the  benchmark, 

lo  =  Number  of  instructions  it  takes  the  stripped  down  CPU  to  execute  the  benchmark, 

Fi  =  Frequency  of  LOAD  instructions  in  the  SPUR  CPU  =  ,  and 

Flood  =  Frequency  of  LOAD  instructions  in  the  stripped  down  CPU  = 

I  assume  the  number  of  LOAD  instructions  (Uu>od)  and  the  average  number  of  cycles  to  exe¬ 
cute  instructions  other  than  LOAD  (NoOur)  to  be  the  same  for  programs  that  are  written  for  either 
pipeline.  Since  the  SPUR  CPU  can  execute  LOAD  without  pipeline  suspension  and  I  assume  the 
external  memory  can  provide  data  within  a  cycle  (if  not,  it  will  affect  either  pipeline  equally),  the 
average  number  of  cycles  to  execute  LOAD  in  the  SPUR  CPU  4-stage  pipeline  is  the  same  as  all 
other  instructions: 

Nioad_A  ~  Nothor  C,'  =  Noihor  (4.4.3) 

Since  the  RISC  II  3-stage  pipeline  always  suspend  the  pipeline  for  one  cycle  whenever 

LOAD  is  executed,  the  average  number  of  cycles  to  execute  LOAD  is  just  one  more  cycles  than 
the  average  for  other  instructions: 

Nloadji  =  Nothtr  +  1 

Co  ~  Fioad^^load^i  ~  ^ load^^^oOitr  ~  ^otktr  Flood  (4.4.4) 

Combining  Equation  4.4.3  and  Equation  4.4.4,  we  have; 

Co=Ci+F'iood  (4-4.5) 

In  general,  the  frequency  of  LOAD  in  the  stripped  down  CPU  (Flood)  wiU  be  slightly  bigger 
than  the  frequency  of  LOAD  in  the  SPUR  CPU  (Fi)  because  programs  written  for  the  SPUR  CPU 
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will  have  some  extra  NOOPs  in  the  delay  slots  of  LOAD.  These  extra  NOOPs  increase  the  total 
number  of  instructions  to  execute  the  benchmark  (/,)  and  since  the  number  of  LOAD  (Uioad)  is 
constant,  the  frequency  is  lower.  The  number  of  NOOPs  in  the  LOAD  s  delay  slots  is  U„oop- 

Unoop  =  li'X^Fi'X  {I-  PORfiu) 

The  number  of  instruction  it  takes  the  stripped  down  CPU  to  execute  a  benchmark  can  be 
calculated  by  subtracting  the  extra  NOOPs  in  the  LOAD’S  delay  slot  from  the  number  of  instruc¬ 
tion  it  takes  the  SPUR  CPU  to  execute  the  same  benchmaric: 

lo  =  h-  U^p  =  /,•  X  [l  -Fi  X  (1  -FOF/i,,)] 

The  frequency  of  LOAD  in  the  stripped  down  CPU  can  now  be  calculated  as; 

F't,«i  =  =  IiX(\-Fi^x({-PORfiuyy  "  1-Fi  x(i-/'C'y</iu) 

Before  we  go  any  further,  let  us  perform  couple  sanity  checks  on  Equation  4.4.6.  Assume 
POR/ut  =  1.  Equation  4.4.6  gives  F^  =Fi.  This  makes  sense  because  this  is  the  case  when  all 
LOAD  delay  slots  are  filled  with  useful  instructions  in  the  SPUR  CPU,  As  a  second  check, 
assume  PORfm=0.  The  maximum  F.-  possible  for  the  SPUR  CPU  4-stage  pipeline  with 
PORfiu  =  0  is  0.5  because  there  must  be  a  NOOP  for  every  LOAD-half  of  the  instructions  are 
NOOP.  Using  PORfui  =0  and  Fj  =0.5,  Equation  4.4.6  predicts  Fu>ad  =  L  This  again  makes  sense 
because  in  the  stripped  down  CPU,  LOAD  does  not  have  to  be  separated  by  NOOP.  Therefore 
our  two  checks  show  Equation  4.4.6  to  be  "sane". 

Combining  Equation  4.4.5  and  Equation  4.4.6,  we  have: 

F- 

\-Fix(\- PORfiu) 

Co  ,  ^  ^ _  (d  4  71 

TT  "  ^  Ci  X  ( 1  -Fi  x(l-PORfui)1 

Conventional  wisdom  says  that  longer  pipeline  usually  has  shorter  cycle  time  because 
longer  pipeline  usually  means  there  will  be  less  thing  to  do  in  each  pipe  stage  [KogSl].  This, 
however,  is  not  the  case  when  we  add  one  more  stage  to  the  RISC  II  3-stage  pipeline  to  form  the 
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SPUR  CPU  4-stage  pipeline  because: 

(1)  The  increase  in  the  number  of  stages  is  not  a  result  of  dividing  the  tasks  into  smaller 
pieces.  The  original  task  (Ifet,  Exec,  and  Wr)  are  the  same  for  both  pipelines. 

(2)  The  extra  Mem  stage  is  a  delay  stage  for  aU  instructions  other  than  LOAD.  This  extra 
pipe  stage  increases  the  complexity  of  the  datapath  and  control  logic. 

Since  (1)  states  that  the  orginal  tasks  are  not  getting  any  simplier  and  (2)  states  that  the  extra  task 
increases  the  complexity,  the  cycle  time  of  the  SPUR  4-stage  pipeline  (7,  )  is  likely  to  be  bigger 
than  the  cycle  time  of  the  RISC  II  3-stage  pipeline  (7>  )• 

Ti  >To  =>  ^  ^  1  (4.4.8) 

The  performance  improvement  of  the  SPUR  CPU  4-stage  pipeline  over  the  stripped  down 
3-stage  pipeline  is  plotted  in  Figure  4-4-2  as  a  function  of  the  portion  of  the  delay  slot  being  filled 


Figure  4-4-2  Performance  Improvement  due  to  the  Extra  Pipe  Stage 

(a)  and  (b)  show  the  performance  improvement  as  a  function  of  the  portion  of  the  LOAD  delay 
slot  being  filled  {POR/m)  by  useful  instructions  and  the  frequency  of  the  LOAD  (F,-).  I  have  as¬ 
sumed  the  average  number  of  cycles  per  non-LOAD  instruction  to  be  1.5  for  (a)  and  2.0  for  (b). 
(a)  and  (b)  assume  the  pipeline  with  the  extra  pipe  stage  has  the  same  cycle  time  as  the  shorter 
pipeline  (7,-  =  To),  (c)  shows  the  effect  if  the  longer  pipeline  has  a  longer  cycle  time  (7.  >  To). 
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T  ■  ■ 

by  useful  instructions  iPOR/m),  frequency  of  LOAD  (F,  ),  and  the  cycle  time  ratio  .  This  is 

I.  . 

the  result  of  applying  Equation  4.4.1, 4.4.2, 4.4.7,  and  4.4.8  to  the  performance  model  (Equation 
4.1.6).  Each  line  in  Figure  4-4-2(a)  and  (b)  assume  a  fixed  frequency  of  LOAD  (10%,  20%  and 
30%)  and  a  fixed  average  number  of  cycles  per  instmction  for  all  non-LOAD  instmctions-1.5  in 
(a)  and  2.0  in  (b).  There  are  two  things  worth  noticing: 

(1)  The  performance  improvement  is  negative  when  PORfm  is  0  because  the  3-stage  pipeline 
only  requires  the  pipeline  to  suspend  for  one  cycle  while  the  4-stage  pipeline  will  waste 
C,  cycles  (C,  >  1)  to  execute  the  NOOP  in  the  delay  slot  due  to  misses  in  the  internal 
instruction  cache. 

(2)  The  portion  of  the  delay  slot  that  must  be  filled  (the  critical  PORfm)  in  order  for  for  the 
4-stage  pipeline  to  have  the  same  performance  as  the  3-stage  pipeline  (IMPi  =  0%)  is  a 
fimction  of  the  average  number  of  cycles  per  non-LOAD  instructions  (C,)  only.  The 
breakeven  point  depends  on  POR/m  and  not  on  the  LOAD  frequency  (F,)  because  the 
number  of  cycles  the  SPUR  CPU  wasted  whenever  a  LOAD  is  executed  {SPUR^cau )  is: 

5F[/Fw«i,  =  Ci  X  [l  -  FOF/a»] 

At  the  critical  POR/m,  the  number  of  cycles  the  SPRU  CPU  wasted  equals  to  the  number  of 
cycles  the  3-stage  pipeline  must  be  suspended  whenever  a  LOAD  is  executed.  In  the  RISC  II  3- 
stage  pipeline,  this  number  of  cycles  is  one.  Therefore: 

C,  X  ^  1  -  (Critical  POR /m )]  =  ^ 

Criucal  FOF/fl,  =  1--^  (4.4.9) 

Using  Equation  4.4.9,  we  have: 

Critical  FOF/,u  (C,=1.5)  =  0.33  Critical  FO/?/,n  (Ci=2.0)  =  0.50 
In  the  discussions  above,  we  have  assumed  the  only  reason  why  non-LOAD  instructions 
take  more  than  one  cycle  to  execute  is  due  to  misses  in  the  internal  instruction  cache.  This  is  true 
for  the  SPUR  CPU  in  which  the  average  number  of  cycles  per  non-LOAD  instruction  (C.)  is 
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estimated  to  be  somewhere  between  1 .5  and  2.0.  Furthermore,  let  us  give  the  4-stage  pipeline  the 
benefit  of  the  doubt  and  assume: 

(1)  The  pipeline  with  the  extra  pipe  stage  has  the  same  cycle  time  as  the  shorter  pipeline. 

(2)  The  frequency  of  LOAD  is  30%  (f  .■  =  0.3). 

(3)  The  LOAD  delay  slot  is  filled  by  useful  instruction  80%  of  the  time  {PORfm  =  0.8). 

The  p)erformance  improvement,  as  shown  in  Figure  4-4-2(a)  and  (b),  is  still  only  between 
9%  and  14%.  However,  as  discussed  earlier  (Inequality  4.4.8),  the  SPUR  CPU  cycle  time  is  likely 
to  be  bigger  than  that  for  the  stripped  down  CPU.  As  shown  in  Figure  4-4-2(c),  the  cycle  time  of 
the  stripped  down  CPU  only  has  to  be  approximately  10%  smaller  than  the  SPUR  CPU  cycle 
time  (between  1  -  0.92  =  8%  and  1  -  0.88  =  12%  to  be  exact)  to  neutralize  out  all  the  performance 
advantage  of  the  SPUR  CPU  4-stage  pipeline. 

4.4.2.  Extra  Pipe  Stage-Impact  on  Resources 

The  resources  impact  of  the  extra  pipe  stage  can  be  estimated  by  comparing  the  resources 
needed  to  implement  the  SPUR  CPU  4-stage  pipeline  to  the  resources  needed  to  implement  the 
RISC  n  3-stage.  Relative  to  the  resources  needed  to  implement  the  RISC  II  3-stage  pipeline,  the 


Extra  Temp. 
Resister 

Extra 

PC 

Extra 

Comparators 

Extra  Control 

Stase 

Total 

Chip  Area 
(mm  X  mm) 

1.1/57 

2.0% 

0.2/57 

0.3% 

0.1/57 

0.2% 

0.3/57  0.5% 

3% 

Transistors 

(xlOOO) 

1.8/115 

1.6% 

0.4/115 

0.3% 

0.2/115 

0.2% 

0.4/115  0.3% 

2% 

Table  4-4-1  Resources  Metrics  for  the  Extra  Pipe  Stage 

Each  column  lists  the  absolute  and  percentage  impact  of  each  components  on  the  different 
resources  metrics.  The  percentage  impact  is  calculated  by  dividing  the  absolute  impact  by  the  to¬ 
tal  value  of  that  metric  in  the  SPUR  CPU.  The  total  impact  of  these  sub-features,  that  is  the  im¬ 
pact  due  to  the  extra  pipe  stage,  is  summarized  in  the  right  most  column. 
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extra  pipe  stage  requires: 

(1)  One  extra  temporary  register  (Dst2)  in  the  lower  datapath  (Figure  2-3-1). 

(2)  One  extra  program  counter  register  (MemPC)  in  the  upper  datapath  (Figure  2-3-2). 

(3)  Two  extra  comparators  for  the  internal  forwarding  logic.  Internal  forwarding  is  discussed 
in  Figure  2- 1-6. 

(4)  One  extra  stage  (MemCtrBuf)  in  the  Sequencer  of  the  Master  Control  (Figure  2-4-3). 

All  these  components  increase  the  area  and  transistors  count  of  the  SPUR  CPU.  Their 
impacts  are  summarized  in  Table  4-4-1.  Their  impacts  on  control  PLA  s  output,  control  PLA  s 
product  terms,  and  output  signal  pins,  however,  are  either  none  or  negligible.  In  order  to  keep  the 
table  simple,  these  negligible  impacts  are  not  shown. 

4.4.3.  Extra  Pipe  Stage-Impact  on  Complexity 

Simulation  effort  cannot  be  used  direcUy  as  complexity  metric  for  the  extra  pipe  stage 
because  we  do  not  have  a  separate  set  of  diagnostics  designated  just  to  test  the  extra  pipe  stage. 
However,  every  SPUR  CPU  diagnostic  checks  this  extra  stage  implicitly.  I  estimated  that  15%  of 
all  the  diagnostics  cycles  were  spent  in  checking  this  stage.  Furthermore,  I  also  believed  that  this 


Absolute  Impact 

Percentage  Impact 

Cycles  of 
Diagnostics 

- 

=  15% 

Man-Month 
of  EITort 

1 

1/3.5  _  29% 

Table  4-4-2  Complexity  Metrics  for  the  Extra  Pipe  Stage 

The  first  column  lists  the  absolute  impact  on  the  complexity  metrics  due  to  the  extra  pipe  stage. 
The  second  column  lists  the  percentage  impact  The  percentage  impact  is  calculated  by  dividing 
the  absolute  impact  by  the  total  value  of  that  metric  in  the  SPUR  CPU.  In  Row  1, 1  do  not  have 
the  exact  values  of  the  Absolute  Impact  but  I  can  estimate  the  Percentage  Impact. 
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extra  pipe  stage  makes  all  diagnostics  more  complex  and  increase  the  simulation  effort  by  one 
man-month-29%.  These  two  numbers  are  summarized  in  Table  4-4-3. 

4.4.4.  Extra  Pipe  Stage-Impact  Summary 

The  extra  pipe  stage’s  impact  on  performance  (Figure  4-4-1),  resources  (Table  4-4-1),  and 
complexity  (Table  4-4-2)  can  be  summarized  in  the  following  inequalities: 

9%  <  Performance  Impact  <14%  =>  Median  Performance  Impact  =  1 1 .5% 

2%  <  Resources  Impact  <  3%  =>  Median  Resource  Impact  =  2.5% 

15%  <  Complexity  Impact  <  29%  =>  Median  Complexity  Impact  =  22% 

The  4-stage  pipeline  does  not  use  up  that  many  resources  but  it  does  increases  the  complex¬ 
ity.  Notice  that  the  best  performance  improvement  is  estimated  to  be  only  around  14%.  This  is 
assuming  the  SPUR  CPU  pipeline  with  the  extra  stage  will  have  the  same  cycle  time  as  the 
stripped  down  CPU’s  3-stage  pipeline.  As  we  can  see  from  Figure  4-4-2(c),  this  performance 
improvement  will  disappear  quickly  if  the  cycle  time  of  the  SPUR  CPU  is  only  slightly  bigger 
than  the  cycle  time  of  the  stripped  down  CPU.  Consequently,  I  do  not  think  the  SPUR  CPU  4- 


Performance 


I,  Performance 
+14% 


Tlesources  Complexity 

(a)  Pessimistic 


Tlesources  Complexity 

(b)  Median 


Resources  Complexity* 

(c)  Optimistic 


Figure  4-4-3  Three  Different  Views  of  the  Tradeoffs 


(a)  shows  the  pessimistic  view  which  uses  the  smallest  performance  impact  and  the  biggest 
resources  and  complexity  impact,  (b)  shows  the  realistic  view  which  uses  the  medi^  perfor¬ 
mance  impact  and  median  resources  and  complexity  impact,  (c)  shows  the  optiniistic  view 
which  uses  the  biggest  performance  impact  and  the  smallest  resources  and  complexity  impact. 
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stage  pipeline  is  a  winning  feature. 

In  general,  I  think  designer  should  be  very  careful  whenever  they  add  "delay"  stages  in  the 
pipeline  to  avoid  pipeline  suspension  due  to  structural  conflicts  [KogSl]  because  the  performance 
improvement  may  be  relatively  small.  There  are  two  reasons  for  this  caution. 

(1)  Due  to  data  or  branch  hazard,  adding  a  delay  stage  in  the  pipeline  is  likely  to  end  up 
creating  a  delay  instruction  and  the  delay  slot  must  be  filled  for  the  longer  pipeline  to  be 
an  advantage. 

(2)  Adding  a  delay  stage  increases  the  length  and  thus  the  complexity  of  the  pipeline  without 
finer  division  of  the  original  task.  This  may  result  in  a  longer  cycle  time. 

4.5.  On-Chip  Instruction  Cache  Evaluation 

The  design  and  implementation  of  on-chip  instruction  cache  are  discussed  in  detail  by 
[Hil87a]  and  [Dun86],  respectively.  Section  4.5.1,  4.5.2,  and  4.5.3  discuss  on-chip  instruction 
cache’s  impact  on  performance,  resources,  and  complexity  of  the  SPUR  CPU.  Section  4.5.4  sum¬ 
marizes  the  restilts. 

4.5.1.  On-Chip  Instruction  Cache-Impact  on  Performance 

This  section  evaluates  the  on-chip  instruction  cache’s  impact  on  performance  by  comparing 
the  performance  of  the  SPUR  CPU  against  an  imaginary  stripped  down  SPUR  CPU  that  does  not 
have  an  internal  instruction  cache.  I  will  use  subscript  "i"  (T,-,  C,-,  A/,  )  for  the  SPUR  CPU  and  the 
subscript  "o"  (Jo,  Co,  Mo)  for  the  stripped  down  CPU.  The  on-chip  instruction  cache  does  not 
affect  the  number  of  instruction  to  perform  any  function  directly.  However,  without  the  on-chip 
instruction  cache,  instmction  fetch  and  data  access  cannot  occur  in  parallel  and  it  becomes  impos¬ 
sible  to  implement  the  SPUR  CPU  4-stage  pipeline.  Therefore  the  stripped  down  CPU  must  use 
the  shorter  RISC  II  3-stage  pipeline.  Based  on  the  discussion  in  Section  4.4,  we  have; 


Chapter  4:  Microarchitectural  Evaluation 


122 


Mi=2-PORfm  (4.5.1) 

PORfitt  =  Portion  of  the  LOAD  delay  slot  is  filled  by  an  useful  instruction 

M,  =  1  (4.5.2) 

The  stripped  down  CPU’s  average  number  of  cycles  per  instruction  or  its  cycle  time  must 
be  bigger  than  the  CPU  because  it  does  not  have  an  on-chip  instruction  cache  and  must  fetch 
every  instruction  from  the  slower  external  cache.  Let  us  assume; 

Assumption  1 

The  external  cache  is  so  big  that  the  instruction  miss  rate  is  negligible.  It  always  takes  one 
cycle  to  fetch  an  instruction  unless  instruction  fetch  is  blocked  by  data  access.  In  other 
words,  the  stripped  down  CPU  without  an  on-chip  instruction  cache  runs  slower  so  that  it 
can  fetch  and  execute  one  instruction  per  cycle  under  most  situations. 

However,  no  matter  how  slow  the  stripped  down  CPU  runs,  it  still  cannot  fetch  and  execute  one 
instruction  every  cycle  because  its  3-stage  pipeline  must  be  suspended  whenever  a  LOAD  or 
STORE  instruction  is  executed  to  avoid  data  access  and  instruction  fetch  conflict.  Based  on  the 
discussion  in  Section  4.4  that  results  in  Equation  4.4.5,  we  get  the  following  equaUon  for  stripped 
down  CPU’s  average  number  of  cycles  per  instruction  {Co)' 

Co  =  Ci  +  Flood  Psion  (4.5.3) 

Ci  =  Ci  if  the  SPUR  CPU  has  a  perfect  on-chip  instruction  cache 
Flood  =  Frequency  of  LOAD  instructions  in  the  stripped  down  CPU 
Fston  =  Frequency  of  STORE  instructions  in  the  snipped  down  CPU 
The  term  C,  is  used  instead  of  Q  in  Equation  4.5.3  because  of  Assumption  1.  Assume  the  exter¬ 
nal  cache  miss  rate  is  low,  the  term  Q  can  be  expressed  as: 

Ci  =  Cj  -  (1  -  HITicachs  )  X  PENioaoho  (4-5.4) 

HITicachs  =  Hit  rate  of  the  SPUR  CPU  on-chip  I-cache 
PENicach*  =  Miss  penality  of  the  SPUR  CPU  on-chip  I-cache 
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In  order  to  simply  the  derivation,  I  will  write  the  frequency  of  the  STORE  instruction  as  a  func¬ 
tion  of  the  LOAD  instruction; 


Fsu,r,  =  (4.5.5) 

Applying  Equation  4.5.4  and  Equation  4.5.5  to  Equation  4.5.3,  we  have: 

Co  —  Ci  +  (1  +  P)  X  Flood  ~  (1  ~  HFTicaclu)  X  PENicacht  (4.5.6) 

As  shown  previously  in  Equation  4.4.6,  frequency  of  LOAD  in  programs  written  for  the 

stripped  down  CPU  (Flood)  can  be  expressed  in  terms  of  the  frequency  of  LOAD  in  programs 
written  for  the  SPUR  CPU  (F,  ).  Consequently,  Equation  4.5.6  can  be  written  as: 


Co  1  (1  +  P)XF,.  (1  —  HITjcacht )  X  PENjcack* 

TT  "  ^  C  X  ( 1  -  fi  X  (1  -POkfinY) 


(4.5.7) 


Figure  4-5-1  Performance  Improvement  due  to  the  On-Chip  Instruction  Cache 

(a)  and  (b)  show  the  performance  improvement  as  a  function  of  the  instruction  cache  hit  rate 
(Hicocho)  and  the  frequency  of  the  LOAD  (f,).  I  have  assumed  the  on-chip  instruction  cache  miss 
penalty  to  be  2.0  for  (a)  and  2.5  for  (b).  (a)  and  (b)  assume  the  on-chip  instruction  cache  has  no 
effect  on  the  cycle  time.  The  effect  of  reduction  of  cycle  time  (T,  <To )  due  to  the  on-chip  instruc¬ 
tion  cache  is  shown  in  (c). 
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The  performance  improvement  of  the  SPRU  CPU  over  the  stripped  down  CPU  that  does  not 
have  an  on-chip  instruction  cache  is  plotted  in  Rgure  4-5-1  as  a  function  of  the  probability  of  the 


r 


To 

tt 


This  is  the 


I-cache  hit  rate  {Huaciu),  frequency  of  LOAD  (F.).  and  the  cycle  time  ratio 

result  of  applying  equations  4.5.1,  4.5.2,  and  4.5.7  to  the  performance  model  (Equation  4.1.6).  In 
order  to  reduce  the  number  of  variables  in  the  graph,  I  have  assumed: 


(1)  70%  of  the  LOAD  delay  slot  are  filled  by  useful  instruction  (PORfm  =  0.7)  for  Equation 
4.5.1. 

(2)  The  frequency  of  STORE  is  70%  of  the  LOAD  frequency  (p  =  0.7)  for  Equation  4.5.7. 

Each  line  in  Figure  4-5-l(a)  and  (b)  assume  a  fixed  frequency  of  LOAD  (20%  and  30%)  and 
a  fixed  instruction  cache  miss  penalty— 2.0  in  (a)  and  2.5  in  (b).  The  lower  limit  of  I-cache  miss 
penalty  is  two  cycles.  This  limit  {PENALTY  =  2.0)  can  only  be  achieved  if  the  external  cache  can 
supply  the  missing  instruction  within  a  cycle.  Therefore  in  Figure  4-5- 1(b),  where  the  I-cache 
miss  penalty  is  two  and  half  cycles  {PENALTY  =  2.5),  the  extra  half  cycle  can  be  considered  as 
miss  penalty  of  the  SPUR  CPU  external  cache.  Therefore  Figure  4-5- 1(b)  is  biased  against  the 
SPUR  CPU  (worst  case)  because  it  assumes  the  SPUR  CPU  has  to  pay  an  external  cache  miss 
penalty  whUe  this  penalty  is  assumed  to  be  negligible  for  the  stripped  down  CPU  (see  Assump¬ 
tion  1).  Notice  that  Figure  4-5-l(a)  and  (b)  predict  the  same  amount  of  performance  improve¬ 
ment  when  Hicach*  =  1  because  neither  case  has  to  pay  the  the  I-cache  miss  penalty  when  the  hit 


rate  is  100%. 

Assumption  1  essentially  states  that  all  things  being  equal,  the  SPUR  CPU  average  number 
of  cycles  per  instruction  will  always  be  higher  than  the  stripped  down  CPU  unless  the  SPUR  CPU 
has  a  perfect  instruction  cache  (100%  hit  rate).  Therefore,  Figure  4-5- 1(a)  and  (b),  which  neglect 


the  on-chip  instruction  cache’s  effect  on  cycle  time 


should  predict  the  performance 


improvement  to  be  negative  for  all  cases  where  //ieocA*  <L  This  is  not  the  case  because  "all  things 
are  not  equal"  for  the  SPUR  CPU  and  the  stripped  down  CPU.  Besides  reducing  the  cycle  time. 
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the  internal  instruction  cache  also  enable  the  SPUR  CPU  to  execute  LOAD  and  STORE  without 
suspending  the  pipeline.  This  is  the  only  advantage  of  the  on-chip  instruction  cache  when  its 
effect  on  cycle  time  is  neglected.  However,  this  advantage  disappear  quickly  as  the  hit  rate 
(//icacA*)  decreases. 


Figure  4-5-l(c)  shows  the  on-chip  instruction  cache’s  impact  on  performance  when  the 
cache’s  effea  on  cycle  time  is  taken  into  account.  We  believe  the  on-chip  instruction  cache 

Mark  HiU  estimated  the  instruction 


improve  the  SPUR  CPU  cycle  time  by  50% 


To 

IT 


=  1.5 


cache  hit  rate  to  be  75%  [Hil87b].  George  Taylor  [Tay86]  estimated  the  frequency  of  LOAD  to 
be  between  20%  and  30%.  Based  on  these  numbers,  we  have: 


Best  Case 

30%  LOAD,  0.7  X  30%  =  24%  STORE,  two  cycles  miss  penalty. 

Worst  Case 

20%  LOAD,  0.7  X  20%  =  14%  STORE,  two  and  half  cycles  miss  penalty. 

As  shown  in  Figure  4-5-l(c),  the  performance  improvement  of  these  two  cases  are  41%  and  19% 
respectively  if  the  on-chip  instruction  cache  can  improve  the  cycle  time  by  50%. 


4.5.2,  On-Chip  Instruction  Cache-Impact  on  Resources 

The  on-chip  instruction  cache  in  the  SPUR  CPU  is  orgamzed  into  an  Instruction  Unit  (Sec¬ 
tion  2.2)  that  can  be  divided  into  two  parts:  (1)  datapath,  and  (2)  controller.  Table  4-5-1  shows  the 
resources  impact  of  the  on-chip  instruction  cache  can  be  estimated  by  counting  the  area  and 
transistors  consumed  by  the  Instruction  Unit.  Table  4-5-1  has  three  columns:  Column  1  shows 
resources  consumed  by  the  datapath  of  the  Instruction  Unit.  Column  2  shows  resources  consumed 
by  the  controller  of  the  Instruction  Unit,  and  Column  3  shows  the  total  resources  consumed  by 


the  Instruction  Unit. 
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I-Unit 

Datanath 

I-Unit 

Controller 

Total 

Chip  Area 
(mm  X  mm) 

16.7/57  29% 

2/57  4% 

33% 

Transistors 
(x  1000) 

36.4/115  32% 

1.2/115  1% 

33% 

Table  4-5-1  Resources  Metrics  for  the  On-Chip  Instruction  Cache 

Each  column  lists  the  absolute  and  percentage  impact  of  each  components  on  the  different 
resources  metrics.  The  percentage  impact  is  calculated  by  dividing  the  absolute  impact  by  the  to¬ 
tal  value  of  that  metric  in  the  SPUR  CPU.  The  total  impact  of  the  datapath  and  the  controller  is 
the  impact  due  to  the  on-chip  instruction  cache  and  is  summarized  in  the  right  most  column. 


The  datapath  of  the  Instruction  Unit,  which  consists  of  the  cache  and  tag  array,  consumes 
32%  of  the  number  of  transistors  but  only  29%  of  the  chip  area.  This  is  a  result  of  the  regularity 
of  the  cache  and  tag  arrays.  On  the  other  hand,  the  controller  of  the  instruction  cache  is  not  as 
regular.  Consequently,  it  consumes  4%  of  the  chip  area  although  it  only  represents  1  %  of  the 
number  of  transistors.  Neither  the  controller  nor  the  datapath  of  the  Instruction  Unit  has  any 
significant  impact  on  the  PLA’s  output  nor  product  terms.  This  indicates  the  Instruction  Unit  is 
relatively  independent  from  the  Execution  Unit.  In  order  to  keep  the  Table  4-5-1  simple,  these 
negligible  impacts  are  not  shown. 

4.5.3.  On-Chip  Instruction  Cache-Impact  on  Complexity 

The  complexity  due  to  the  on-chip  instruction  cache  can  be  quantified  by  the  simulation 
effort.  The  on-chip  instruction  cache  of  the  SPUR  CPU  is  simulated  at  both  the  behavioral  and 
switch  level.  The  diagnostics  we  used  for  switch  level  simulation  are: 

(1)  cc-IB_disabled  -  This  diagnostic  takes  364  cycles  to  verify  that  the  SPUR  CPU  can  at 
least  run  with  the  Instruction  Unit  disabled. 
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Percentage  Impact 

BH 

4,994 

4,994/55,516  =  9% 

Man-Month 
of  Effort 

0.25 

0.25/3.5  =  7% 

Table  4-5-2  Complexity  Metrics  for  the  On-Chip  Instruction  Cache 

The  first  column  lists  the  absolute  impact  on  the  complexity  metrics  due  to  the  on-chip  instruc¬ 
tion  cache.  The  second  column  lists  the  percentage  impact  The  percentage  impact  is  calculated 
by  dividing  the  absolute  impact  by  the  total  value  of  that  meuic  in  the  SPUR  CPU. 


(2)  cc-IB_fetch  -  This  diagnostic  takes  356  cycles  to  verify  that  the  SPUR  CPU  can  run  with 
the  Instruction  Unit  enabled  but  prefetching  disabled. 

(3)  cc-IB_fetch  -  This  diagnostic  takes  261  cycles  to  verify  that  the  SPUR  CPU  can  run  with 
the  Instruction  Unit  and  prefetching  enabled. 

(4)  cc-IB_stuck  -  This  diagnostic  takes  4013  cycles  to  verify  that  there  is  no  "stuck-at" 
errors  in  the  Instruction  Unit. 

The  total  number  of  cycles  and  the  man-month  of  simulation  effort  are  two  complexity 
metrics  we  can  extracted  from  this  set  of  diagnostics.  Column  1  of  Table  4-5-2  is  the  absolute 
value  of  these  two  metrics.  The  percentage  impact  of  each  metric  is  calculated  in  Column  2. 
Notice  that  the  increase  in  complexity  due  to  the  Instruction  Unit  is  relatively  small. 

4.5.4.  On-Chip  Instruction  Cache-Impact  Summary 

The  impact  of  the  on-chip  instruction  cache  on  performance,  resources  allocation,  and  com¬ 
plexity  are  illustrated  in  Figure  4-5-1,  Table  4-5-1,  and  Table  4-5-2  respectively.  The  results  of 
these  figure  and  tables  can  be  summarized  as: 

19%  <  Performance  Impact  <41%  =>  Median  Performance  Impact  =  30% 

33%  <  Resources  Impact  <  33%  =>  Median  Resources  Impact  =  33% 
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►Performance 


Performance 


►  Performance 
441% 


■  Resources  Complexity 

(a)  Pessimistic 


■  Resources  Complexity^ 

(b)  Median 


Resources  Complexity'* 

(c)  Optimistic 


Figure  4-5-2  Three  Different  Views  of  the  Tradeoffs 

(a)  shows  the  pessimistic  view  which  uses  the  smallest  performance  impact  and  the  biggest 
resources  and  complexity  impact,  (b)  shows  the  median  view  which  uses  the  median  perfor¬ 
mance  impact  and  median  resources  and  complexity  impact,  (c)  shows  the  optimistic  view 
which  uses  the  biggest  performance  impact  and  the  smallest  resources  and  complexity  impact 


1%  <  Complexity  Impact  <  9%  =>  Median  Complexity  Impact  =  8% 

These  ranges  of  impact  numbers  can  be  used  to  formulate  the  optimistic,  median,  and  pes¬ 
simistic  views  of  the  performance,  resources,  and  complexity  tradeoffs.  These  three  views  are 
shown  graphically  in  Figure  4-5-2.  Notice  that  although  the  Instruction  Unit  consumes  a  large 
amount  of  resources,  its  impact  on  complexity  is  relatively  small. 

4.6.  Multiprocessing  Support  Evaluation 

The  SPUR  CPU  supports  multiprocessing  by  communicating  with  the  Cache  Controller 
Chip'  [Woo86]  via  a  specialized  coprocessor  interface  [WEG87].  The  performance  model 
developed  in  Section  4. 1  cannot  be  used  here  because  it  is  developed  for  uniprocessor  perfor¬ 
mance  analysis  only.  Uniprocessor’s  performance  is  a  function  of  cycle  time,  instruction  count, 
and  average  number  of  cycles  per  instruction.  These  factors  are  not  as  significant  in  a  multipro¬ 
cessor  environment  where  many  processors  work  in  parallel.  Consequently,  multiprocessor’s  per¬ 
formance  depends  more  on  the  number  of  processors  [Kat85],  bus  traffic  [Gib87],  and  cache 
behavior  [EgK88].  Since  all  these  factors  are  outside  the  scope  of  this  thesis,  the  multiprocessing 


Chapter  4:  Microarchitectural  Evaluation 


129 


support’s  impact  on  performance  will  not  be  studied  here.  Instead,  we  will  concentrate  on  mul¬ 
tiprocessing  support’s  impact  on  resources  in  Section  4.6.1,  and  complexity  in  Section  4.6.2.  The 
results  is  summarized  in  Section  4.6.3. 


4.6.1.  Multiprocessing  Support-Impact  on  Resources 


Instructions 

Cache  Controller 
Interfaces 

Total 

6/54 

11% 

0/54 

0% 

11% 

Control  PLA 
Products 

2/84 

2% 

0/84 

0% 

2% 

Chip  Area 
(mm  X  mm) 

0/57 

0% 

2.2/57 

4% 

4% 

Transistors 
(x  KXX)) 

0/115 

0% 

0.8/115 

1% 

1% 

Number  of 
Signal  Pins 

0/156 

0% 

15/156 

10% 

10% 

Table  4-6-1  Resources  Metrics  for  Multiprocessing  Support 

Each  column  lists  the  absolute  and  percentage  impact  of  each  sub-feature  on  the  different 
resources  metrics.  The  percentage  impact  is  calculated  by  dividing  the  absolute  impact  by  the  to¬ 
tal  value  of  that  metric  in  the  SPUR  CPU.  The  total  impact  of  these  sub-features,  that  is  the  im¬ 
pact  due  to  multiprocessing  support,  is  summarized  in  the  right  most  column. 


The  multiprocessing  support  requires  seven  load  instructions,  three  store  instructions,  and  a 
Cache  Controller  Interface.  Although  all  load  or  store  instructions  are  alike  internally,  the  CPU 
must  request  different  cache  operations  for  different  load  or  store  instructions  via  the  cache  con¬ 
troller  interface.  The  impact  of  multiprocessing  support  on  resources  are  quantified  into  several 
resources  metrics  in  Table  4-6-1.  The  instruction’s  impact  are  mainly  on  the  number  of  control 
PLA  outputs.  On  the  other  hand,  the  Cache  Controller  Interface’s  major  impact  is  on  the  number 
of  signal  pins.  The  number  of  transistors  and  active  chip  area  consumed  by  the  multiprocessing 
support  are  relatively  small. 
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4.6.2.  Multiprocessing  Support-Impact  on  Complexity 

The  complexity  due  to  the  multiprocessing  support  can  be  quantified  by  the  simulation 
effort.  The  cooperation  of  the  SPUR  CPU  and  the  Cache  Controller  is  simulated  in  both  the 
behavioral  and  switch  level.  The  diagnostics  we  used  for  switch  level  simulation  are: 

(1)  cc-epromLd32  -  This  diagnostic  takes  260  cycles  to  verify  that  the  SPUR  CPU  can  work 
together  with  the  Cache  Controller  to  load  32-bit  data  from  the  external  word. 

(2)  cc-ldSt40  -  This  diagnostic  takes  1184  cycles  to  verify  that  the  SPUR  CPU  can  work 
together  with  the  Cache  Controller  to  load  and  store  40-bit  data  from  and  to  the  external 
world. 

(3)  cc-short_hit  -  This  diagnostic  takes  2131  cycles  to  verify  that  the  SPUR  CPU  can  work 
together  with  the  Cache  Controller  to  handle  a  cache  hit  situation. 

(4)  cc-shortMissPF  -  This  diagnostic  takes  1392  cycles  to  verify  that  the  SPUR  CPU  can 
work  together  with  the  Cache  Controller  to  handle  a  cache  miss  that  results  in  a  page 
fault 

(5)  cc-ptetMissPF  -  This  diagnostic  takes  1585  cycles  to  verify  that  the  SPUR  CPU  can 
work  together  with  the  Cache  Controller  to  handle  a  cache  miss  that  results  in  a  page  fault 


Absolute  Impact 

Percentage  Impact 

Cycles  of 
Diagnostics 

13,875 

13,875/55,516  =  25% 

Man-Month 
of  Effort 

1.0 

1.0/3.5  =  29% 

Table  4-6-2  Complexity  Metrics  for  Multiprocessing  Support 

The  first  column  lists  the  absolute  impact  on  the  complexity  metrics  due  to  hardware  tag  check¬ 
ing.  The  second  column  lists  the  percentage  impact.  The  percentage  impact  is  calculated  by  di¬ 
viding  the  absolute  impact  by  the  total  value  of  that  metric  in  the  SPUR  CPU. 
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and  the  page  table  entry  is  not  in  the  cache. 

(6)  cc-short_si  -  This  diagnostic  takes  1805  cycles  to  verify  that  the  SPUR  CPU  can  woiic 
together  with  the  Cache  Controller  to  handle  an  interrupt. 

(7)  catch-fault  -  This  diagnostic  takes  5518  cycles  to  verify  that  the  SPUR  CPU  can  work 
together  with  the  Cache  Controller  to  handle  page  faults  that  caused  by  store  operations. 

The  total  number  of  cycles  and  the  man-month  of  simulation  effort  are  two  complexity 
metrics  we  can  extracted  from  this  set  of  diagnostics.  The  metrics  are  summarized  in  Table  4-6- 
2.  Notice  that  the  increase  in  complexity  due  to  multiprocessing  support  is  relatively  big. 

4.6.3.  Multiprocessing  Support-Impact  Summary 

The  impact  of  the  on-chip  instruction  cache  on  resources  allocation  and  complexity  are 
illustrated  in  Table  4-6-1  and  Table  4-6-2  respectively.  The  results  of  these  figure  and  tables  can 
be  summarized  as: 


A  Resources  - 

.Resources  - 

k  Resources 

-1-11% 

'W/ 

■f6% 

,  +27% 

-1-1% 

. 1 

Complexity 
(a)  Pessimistic 


Complexity 
(b)  Median 


-^25% 


Complexity 
(c)  Optimistic 


Figure  4-6-1  Three  Different  Views  of  the  Tradeoffs 

(a)  shows  the  pessimistic  view  which  causes  the  biggest  resources  and  complexity  impact,  (b) 
shows  the  median  view  which  causes  the  median  resources  and  complexity  impact,  (c)  shows  the 
optimistic  view  which  causes  the  smallest  resources  and  complexity  impact 
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1  %  <  Resources  Impact  <11%  =>  Median  Resources  Impact  =  6% 

25%  5  Complexity  Impact  <29%  =>  Median  Complexity  Impact  =  27% 

These  ranges  of  impact  numbers  can  be  used  to  formulate  the  optimistic,  median,  and  pes¬ 
simistic  views  of  the  resources  and  complexity  impact.  These  three  views  are  shown  graphically 
in  Figure  4-6-2.  Notice  that  while  the  multiprocessing  support  does  not  consume  a  large  amount 
of  resources,  its  impact  on  complexity  is  large. 

4.7.  Microarchitectural  Evaluation  Summary 

The  performance  model  introduced  in  Section  4.1  allows  us  to  study  performance  quantita¬ 
tively.  Section  4.7.1  summarizes  how  this  simple  model  was  used  to  study  the  performance 
improvement  caused  by  different  SPUR  CPU  features.  Section  4.7.2  gives  a  quantitative  argu¬ 
ment  for  keeping  the  cycle  time  and  average  number  of  cycles  per  instmction  low.  Finally,  Sec¬ 
tion  4.7.3  discusses  a  systematic  approach  to  the  performance,  resources,  and  complexity  trade¬ 
offs. 

4.7.1.  Versatility  of  the  Performance  Model 

The  performance  model  introduced  in  Section  4. 1  allows  us  to  study  different  microarchi¬ 
tectural  features’  impact  on  performance  by  comparing  the  performance  of  the  advanced  microar¬ 
chitecture  with  that  feature  against  a  stripped  down  microarchitecture  without  that  feature.  This 
performance  model  has  only  five  parameters: 

(1)  Mi :  The  number  of  instruction  it  takes  to  perform  a  certain  function  with  the  architectural 
feature  under  consideration. 

(2)  Mo :  The  number  of  instruction  it  takes  to  perform  a  certain  function  without  the  architec¬ 
tural  feature  under  consideration. 

Fi :  The  frequency  of  the  architectural  feature  being  used. 


(3) 
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(4) 

(5) 


:  The  cycle  time  ratio. 


The  average  number  of  cycles  per  instruction  ratio. 


This  simple  performance  model  is  versatile  enough  to  let  us  evaluate  the  performance 
impact  of  different  SPUR  CPU  features  although  they  affect  the  performance  very  differently.  All 
we  have  to  do  is  find  the  proper  values  or,  in  more  complex  cases,  find  the  proper  expression  for 
the  five  parameters.  For  example,  in  Section  4.4  the  number  of  instructions  it  takes  to  execute  a 
LOAD  (Mi)  is  expressed  in  terms  of  the  proportion  of  the  LOAD  delay  slot  being  filled  by  useful 
instruction  PRO/m.  Below  is  a  summary  of  how  the  different  features  of  the  SPUR  CPU  affect 
these  five  parameters.  In  all  cases,  I  have  used  the  subscript  "i"  for  the  SPUR  CPU  (7),  Ci,  Mi) 
and  subscript  "o"  for  the  stripped  down  CPU  (To ,  C<, ,  Mo). 


LISP  Support: 

Mi  =  1.  Mo  >  1  and  depends  on  the  number  of  instructions  it  takes  to  do  the  explicit  tag 

Co 

checking.  Fi  is  the  frequency  of  the  instructions  that  requires  tag  checking.  Finally, 


and 


To 

IT 


are  not  affected  directly. 


FPU  Support: 

Mi  =  1.  Mo  >  1  and  depends  on  the  number  of  instructions  it  takes  to  emulate  the  floating 

C 

point  operations.  Fi  is  the  frequency  of  the  floating  point  operations.  <  1  because  even 

with  the  FPU  coprocessor,  the  number  of  cycles  it  takes  the  CPU-FPU  combination  to  exe¬ 
cute  a  floating  point  operation  is  still  bigger  than  the  average  number  of  cycles  per  CPU 
T 

instruction.  Finally,  is  not  affected  directly. 


Extra  Pipeline  Stage: 

M»  =  1.  M,  >  1  and  depends  on  the  proportion  of  the  LOAD  delay  slot  being  fiUed  by  useful 
instruction.  Fi  is  the  frequency  of  the  LOAD.  >  1  because  the  stripped  down  CPU 
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T 

must  suspend  the  pipeline  for  one  cycle  whenever  LOAD  is  executed.  Finally,  is  not 
affected  directly. 

On-Chip  Instruction  Cache: 

Mo  =  1.  Mi  >  I  and  depends  on  the  proportion  of  the  LOAD  delay  slot  being  filled  by  useful 
instruction.  F,-  is  the  frequency  of  the  LOAD  and  STORE.  >  1  because  the  stripped 
down  CPU  must  suspend  the  pipeline  for  one  cycle  whenever  LOAD  and  STORE  is  exe¬ 
cuted.  Finally,  >  1  because  the  on-chip  instruction  cache  eliminate  the  need  for  going 
off-chip  to  fetch  every  instruction. 

I  must  point  out  that  the  performance  improvement  due  to  LISP  and  FPU  support  are  very 
program  dependent  The  FPU  support  will  not  benefit  any  program  that  does  not  contain  any 
floating  point  operation.  Similarly,  the  LISP  support  feature  will  not  benefit  any  program  that  is 
not  written  in  LISP. 

4.7.2.  The  Need  for  Speed 

Figure  4-5- 1(c)  shows  that  the  performance  improvement  predicted  by  the  best  argument 
for  having  on-chip  instruction  cache  increases  47%  (from  -6%  to  41%)  while  the  worst  argument 
increases  only  40%  (from  -21%  to  19%)  when  the  cycle  time  is  improved  by  50%.  In  general,  the 
best  argument  for  having  a  particular  feature  improves  faster  than  the  worst  argument  when  the 
cycle  time,  the  average  number  of  cycles  per  instruction,  or  both  are  improved  by  that  feature. 
From  the  opposite  viewpoint,  the  best  argument  for  having  a  particular  feature  degrades  faster 
than  the  worst  argument  when  the  cycle  time,  or  the  average  number  of  cycles  per  instruction  or 
both  are  degraded  by  that  feature.  This  is  illustrated  in  Figure  4-2- 1(b),  Figure  4-3- 1(c),  and  Fig¬ 
ure  4-4-2(c).  For  example,  in  Figure  4-2- 1(b)  when  the  p  factor  (the  product  of  the  cycle  time 
ratio  and  the  average  number  of  cycles  per  cycle  ratio)  is  reduced  from  1.0  to  0.9,  the  best  argu¬ 
ment  for  having  tag  checking  drops  30%  (from  204%  to  174%),  the  median  argument  only  drops 
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Figure  4-7-1  The  Effect  of  Degrading  P  Factor 

This  is  a  simplified  version  of  Figure  4-2-l(b)  which  shows  how  the  performance  improvement 
due  to  hardware  tag  checking  is  affected  by  degrading  cycle  time  or  average  number  of  cycles 
per  instruction  or  both.  The  p  factor  is  defined  as  the  product  of  the  cycle  time  ratio  and  the 
average  number  of  cycles  per  cycle  ratio.  Notice  that  as  p  decreases,  the  performance  improve¬ 
ment  predicted  by  the  best  argument  (top  line)  drops  off  faster  than  the  median  argument  (mid¬ 
dle)  which  drops  faster  than  the  worst  argument  (bottom).  This  make  sense  mathematically  be¬ 
cause  all  these  lines  must  merge  at  minus  100%  when  p  =  0. 


19%  (from  92%  to  73%),  while  the  worst  argument  drops  even  less,  only  13%  (from  24%  to 
11%).  As  illustrated  in  Figure  4-7-1,  this  make  sense  mathematically.  However  as  engineers,  we 
were  taught  not  to  believe  the  mathematics  unless  it  makes  sense  physically! 

In  order  to  understand  why  the  top  line  drops  faster  than  the  bottom  line  (Figure  4-7-1),  we 
have  to  look  at  the  case  p  =  1.  When  p  =  1,  we  are  ignoring  the  effects  of  the  new  feature  on 
cycle  time  and  the  average  number  of  cycles  per  instructioa  A  big  performance  improvement  at 
p  =  1  (top  line)  means  adding  that  feature  can  reduce  the  number  of  instructions  (1)  by  a  large 
amount  Seen  another  way,  it  means  getting  rid  of  the  feature  will  increase  the  number  of  instmc- 
tions  by  a  large  amount.  In  our  LISP  example  Figure  4-7- 1(b),  the  best  argument  for  having  tag 
checking  (top  line)  therefore  predicts  an  increase  of  204  instructions  for  every  100  instructions  if 
tag  checking  is  removed.  The  worst  argument  predicts  only  an  increase  of  24  extra  instructions 
for  every  100  instruction. 
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At  p=\,  the  gap  between  the  best  and  worst  arguments  represents  this  difference  in  number 
of  instructions  (204  -  24  =  180)  because  we  are  neglecting  hardware  tag  checking’s  effect  on  cycle 
time  and  average  number  of  cycles  per  instruction.  On  the  other  hand,  at  p  <1,  the  gap  between 
the  best  and  worst  arguments  represents  the  time  the  stripped  down  CPU  takes  to  execute  the 
extra  180  instmctions.  If  by  getting  rid  of  hardware  tag  checking  we  can  reduce  the  cycle  time  or 
the  average  number  of  cycles  per  instruction  or  both  (smaller  p),  the  time  it  takes  the  stripped 
down  CPU  to  execute  the  extra  instructions  is  reduced.  Consequently  as  we  move  towards  the 
p=0  point,  the  gap  between  the  best  and  worst  arguments  gets  smaller  and  smaller.  Finally,  at 
p=0,  the  stripped  down  CPU’s  cycle  time  and  average  number  of  cycles  per  instruction  is  so 
much  faster  than  the  SPUR  CPU  that  the  time  to  execute  the  extra  instructions  is  negligible-the 
gap  disappears.  As  a  matter  of  fact,  at  p=0,  the  time  the  stripped  down  CPU  takes  to  execute  the 
benchmark  is  practically  zero  compare  to  the  execution  time  of  the  SPUR  CPU.  The  performance 
improvement  is  therefore  -100%. 

In  my  opinion,  this  is  a  good  quantitative  argument  for  keeping  the  cycle  time  and  average 
number  of  cycles  per  instruction  low  at  all  cost  because  they  benefit  aU  instructions— not  just  one 
particular  instruction.  However,  I  must  also  point  out  that  when  I  say  reduce  the  cycle  time,  it 
does  not  mean  just  reduce  the  CPU  cycle  time.  The  environment  surrounding  the  CPU-such  as 
the  memory  system  and  the  I/O  devices-must  also  speed  up.  Otherwise  the  extra  wait  states  will 
lower  the  performance  improvement  This,  of  course,  is  one  version  of  Amdahl’s  law  [Amd67] 
which  states  that  the  speed  of  any  computation  is  limited  by  its  slowest  part 

4.7.3.  Performance  Resources  and  Complexity  Tradeoffs 

Resources  and  complexity  are  two  separate  ways  we  can  pay  for  performance.  As  we  can 
see  from  our  analysis,  resources  and  complexity  are  quite  independent  For  example,  the  on-chip 
instraction  cache  has  a  large  impact  on  resources  but  only  a  small  impact  on  complexity.  On  the 
other  hand,  the  multiprocessing  support  has  large  impact  on  complexity  but  only  a  small  impact 
on  resources.  The  microarchitect  has  many  options  to  achieve  the  desired  performance  by 
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Figure  4-7-2  Performance  Resources  and  Complexity  Tradeoffs 

The  options,  which  correspond  to  different  features  that  can  be  included  in  the  microarchitecture, 
are  placed  in  increasing  complexity  on  the  vertical  axis.  The  performance  and  resources  needed 
for  these  options  are  plotted  on  the  horizontal  axes.  Each  module  has  its  own  minimum  perfor¬ 
mance  requirement  which  is  a  direct  result  of  the  overall  performance  goal.  This  performance  re¬ 
quirement  place  an  "acceptable  performance"  bound  on  the  Performance  axis. 


selecting  different  features  for  the  microarchitecture.  All  options  involve  tradeoffs  between  per¬ 
formance,  resources,  and  complexity. 

This  type  of  tradeoff  is  shown  graphically  in  Figure  4-7-2.  Since  we  are  considering  what 
features  to  be  included  in  the  CPU,  Option  1  could  be  a  basic  CPU  that  has  minimum  features.  It 
is  simple  in  complexity,  low  cost  in  resources,  and  low  performance.  Option  2  could  be  a  CPU 
similar  to  Option  1  but  the  microarchitect  uses  resources  to  pay  for  more  performance  by  adding 
a  very  large  instruction  cache.  This  large  instruction  cache  increases  the  resources  a  lot  but  only 
increases  the  complexity  slightly.  Option  3  could  be  a  CPU  similar  to  Option  1  but  the  microar¬ 
chitect  uses  complexity  to  pay  for  more  performance  by  using  a  very  long  pipeline.  This  long 
pipeline  increases  the  complexity  a  lot  but  only  increases  the  resources  slighdy.  Option  4  could 
be  a  combination  of  Option  2  and  Option  3.  The  microarchitect  uses  both  resources  and  complex¬ 
ity  to  pay  for  more  performance  by  using  a  moderate  size  instruction  cache  and  a  moderate  length 
pipeline.  The  interaction  between  the  pipeline  and  the  instruction  cache  further  increases  the 
complexity.  However,  since  the  cache  is  much  smaller  than  the  cache  in  Option  2,  the  resources  it 
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consumes  is  less.  Finally,  Option  N  could  be  a  machine  that  is  so  complex  that  its  perfonnance  is 
less  than  a  simple  machine  that  uses  far  fewer  resources. 

The  performance  requirement  for  the  CPU  places  an  "acceptable  performance"  bound  on 
the  performance  axis  (Figure  4-7-2).  Given  this  performance  requirement,  the  microarchitect 
must  include  enough  features  in  the  microarchitecture  such  that  the  performance  requirement  is 
met  while  at  the  same  time  stay  within  the  resources  and  complexity  constraints.  Here  is  the 
recommended  approach: 

(1)  Make  an  educated  guess  on  how  many  resources  you  are  willing  to  spend  or  are  available 
to  the  CPU.  This  places  a  "resource  available"  bound  on  resources  axis  in  Figure  4-7-2. 

(2)  Within  this  bound,  pick  the  simplest  option  available. 

(3)  If  this  option’s  performance  is  within  the  acceptable  range,  then  mission  accomplished. 
Otherwise,  go  to  Step  4. 

(4)  If  there  are  any  other  options  within  the  resource  bound,  pick  the  next  more  complex 
option  and  go  back  to  Step  3.  Otherwise  go  to  Step  5. 

(5)  If  possible,  go  to  Step  1  and  increase  the  resources  available  bound.  Otherwise,  you  may 
need  to  reduce  the  performance  expectation. 

Using  Figure  4-7-2  as  an  example.  Step  2  of  this  procedure  will  pick  Option  1.  However,  in 
Step  3,  we  will  find  out  Option  I’s  performance  is  below  the  acceptable  range.  In  Step  4,  Option 
2  will  not  be  considered  because  it  is  beyond  the  resources  available  bound.  Option  3  wiU  be 
chosen  because  it  is  less  complex  than  Option  4.  Unfortunately,  in  Step  3,  we  will  again  find  out 
Option  3’s  performance  is  not  acceptable.  This  will  lead  us  back  to  Step  4  and  select  Option  4. 
This  time,  when  we  get  back  to  Step  3,  we  will  find  out  Option  4’s  performance  is  acceptable. 

This  kind  of  tradeoff  should  always  be  in  a  microarchitect’s  mind  and  I  see  absolutely  no 
reason  why  we  can  build  CAD  tools  to  place  and  route  million  of  gates  but  cannot  build  tools  to 
help  designer  to  make  tradeoffs  in  this  fashion.  In  this  chapter,  the  performance  resources  and 
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complexity  tradeoffs  are  done  on  the  complete  CPU.  In  chapter  5, 1  will  show  how  this  tradeoffs 
can  be  extended  into  lower  level  modules  that  form  the  CPU.  There  are  two  things  I  want  to 
point  out: 

(1)  We  always  pick  the  simplest  option  because  a  complex  option  requires  the  most  expen¬ 
sive  resource-the  human  designer.  The  simplest  solution  can  also  make  use  of  the  newest 
technology.  When  technology  is  improving  fast  it  can  negate  the  performance  advantage 
of  complex  solution  that  takes  longer  to  implement  and  debug. 

(2)  We  do  not  pick  the  highest  performance  option  because  any  option  within  the  perfor¬ 
mance  specification  is  as  acceptable  as  the  highest  performance  option.  The  reason  is  that 
the  increase  in  CPU  perfomiance  alone  will  not  improve  the  system  performance  drasti¬ 
cally  unless  you  speed  up  all  components  of  the  system.  This  is  another  version  of 
Amdahl’s  law  [Amd67]. 

In  this  chapter,  we  performed  the  performance  resources  and  complexity  tradeoffs  analysis 
after  the  SPUR  CPU  was  built.  This  is  educational,  but  may  be  too  late  unless  we  are  willing  to 
build  multiple  prototype  to  correct  our  mistakes.  Next  chapter,  I  wiU  show  a  systematic  approach 
that  will  perform  this  type  of  analysis  earlier  in  the  design  process. 


Chapter  4:  Microarchitectural  Evaluation 


140 


4.8.  REFERENCES 


[Amd67] 

[Bos88] 

[Dun86] 

[EgK88] 

[Gib87] 

[HaK86] 

[Han88] 

[Hen85] 

[Hil87a] 

[Hil87b] 

[Hil88] 

[Kat85] 

[Kog81] 

[Pat89] 

[Sip82] 

[StH88] 

[Tay89] 

[Tay86] 


G.  Amdahl,  “Validity  of  the  Single  Processor  Approach  to  Achieving  Large  Scale 
Computing  Capabilities”,  Proceedings  AFIPS  1967  Spring  Joint  Computer 
Conference  30,  Atlantic  Gty,  New  Jersy,  April,  1967. 

B.  K.  Bose,  VLSI  Design  Techniques  for  Floating-Point  Computation,  Doctoral 
Dissertation,  Computer  Science  Division,  EECS  Department,  University  of 
California,  Berkeley,  November  1988. 

R.  R.  Duncombe,  The  SPUR  Instruction  Unit:  An  On-Chip  Instruction  Cache 
Memory  for  a  High  Performance  VLSI  Multiprocessor,  Master  Report,  EECS 
Department,  University  of  California,  Bericeley,  CA  94720,  August,  1986. 

S.  Eggers  and  R.  Katz,  “A  Characterization  of  Sharing  in  Parallel  Programs  and  its 
Application  to  Coherency  Protocol  Evaluation”,  The  15th  Annual  International 
Symposium  on  Computer  Architecture,  Honolulu,  Hawaii,  May  30-June  2,  1988. 

G.  Gibson,  “Estimating  Performance  of  Single  Bus,  Shared  Memory 
Multiprocessors”,  Report  No.  UCB/Computer  Science  Dpt  87/355,  Computer 
Science  Division,  EECS  Department,  University  of  California,  Bericeley,  May  1987. 

P.  Hansen  and  S.  Kong,  “SPUR  Coprocessor  Interface  Description”,  Report  No. 
UCB/Computer  Science  Dpt  87/308,  Computer  Science  Division,  EECS 
Department,  University  of  California,  Berkeley,  October  1986. 

P.  M.  Hansen,  Coprocessor  Architectures  for  VLSI,  Doctoral  Dissertation,  Computer 
Science  Division,  EECS  Department,  University  of  California,  Berkeley,  November 
1988. 

J.  Hennessy,  “VLSI  RISC  Processor”,  VLSI  Systems  Design  VI,  10  (October  1985). 

M.  D.  Hill,  Aspects  of  Cache  Memory  and  Instruction  Buffer  Performance,  Doctoral 
Dissertation,  Computer  Science  Division,  EECS  Department  University  of 
California,  Berkeley,  FaU  1987. 

M.  Hill,  Private  Communication  Computer  Science  Division,  EECS  Department, 
University  of  California,  Berkeley,  CA  94720,  December,  1987. 

M.  Hill,  “A  Case  for  Direct-Mapped  Caches”,  Computer  21, 12  (December  1988). 

R.  Katz,  et  al.,  “Memory  Hierarchy  Aspects  of  a  Multiprocessor  RISC:  Cache  and 
Bus  Analyses”,  Report  No.  UCB/Computer  Science  Dpt.  85/221,  Computer  Science 
Division,  EECS  Department,  University  of  California,  Berkeley,  January  1985. 

P.  M.  Kogge,  The  Architecture  of  Pipelined  Computers,  McGraw-Hill  Book 
Company,  1981. 

D.  Patterson,  Private  Communication  Computer  Science  Division,  EECS 
Department,  University  of  California,  Berkeley,  CA  94720,  January,  1989. 

T.  N.  Sippel,  Floating  RISC:  Implementation  and  Analysis  of  Floating  Point  on 
RISC  I,  Master  Report,  Computer  Science  Division,  EECS  Department,  University  of 
California,  Berkeley,  CA  94720,  August,  1982. 

P.  Steenkiste  and  J.  Hennessy,  “Lisp  on  a  Reduced-Instruction-Set  Processor: 
Characterization  and  Optimization”,  Computer  21, 7  (July  1988). 

G.  Taylor,  Private  Communication  Computer  Science  Division,  EECS  Department, 
University  of  California,  Berkeley,  CA  94720,  January,  1989. 

G.  Taylor  etal.  Evaluation  of  the  SPUR  Lisp  Architecture,  The  13th  Annual 
International  Symposium  on  Computer  Architecture,  Tokyo,  Japan,  June  2-5, 1986. 


Chapter  4:  Microarchitectural  Evaluation 


141 


[WEG87]  D.  Wood,  S.  Eggers  and  G.  Gibson,  “SPUR  Memory  System  Architecture”,  Report 
No.  UCB/Computer  Science  Dpt.  87/394,  Computer  Science  Division,  EECS 
Department,  University  of  California,  Bericeley,  December  1987. 

[Woo86]  D.  A.  Woodetal.,  “An  In-Cache  Address  Translation  Mechanism”,  The  13th 
Annual  International  Symposium  on  Computer  Architecture,  Tokyo,  Japan,  June  2-5, 
1986. 

[Zor89]  B.  Zorn,  Private  Communication  Computer  Science  Division,  EECS  Department, 
University  of  California,  Berkeley,  CA  94720,  January,  1989. 


Chapter  5:  A  Systematic  Approach 


142 


Chapter  5 

A  SYSTEMATIC  APPROACH  TO 
MICROARCHITECTURAL  DESIGN 

I  shall  never  believe  the  God  plays  dice  with  the  world. 

Albert  Einstein,  1947 


The  goal  of  this  thesis,  as  stated  in  Chapter  1.  is  to  provide  a  quantitative  way  to  evaluate 
microarchitectures  and  a  systematic  way  to  design  them.  In  Chapter  4,  I  have  used  the  SPUR 
CPU  as  an  example  to  show  how  microarchitecture  can  be  evaluated  quantitatively.  In  this 
chapter,  I  show  how  to  approach  the  microarchitectural  design  problem  systematically. 

5.1.  The  Microarchitectural  Design  Problem 

This  section  discusses  the  microarchitectural  design  problem.  Section  5.1.1  formally  defines 
the  term  microarchitecture  and  then  the  phrase  "microarchitectural  design."  Section  5.1.2  intro¬ 
duces  a  set  of  important  issues  that  are  important  to  microarchitectural  design.  Section  5.1.3 
shows  a  systematic  approach  to  these  microarchitectural  issues. 

5.1.1.  Microarchitectural  Design-The  Definition 

The  term  microarchitecture  was  defined  informally  in  Chapter  1  as  the  specification  of  how 
the  macroarchitecture  is  implemented  in  a  given  technology.  More  formally,  the  term  microarchi¬ 
tecture  can  be  defined  with  respect  to  Gajski’s  tripartite  representation  [Gaj85]  as  all  the  informa¬ 
tion  the  designer  knows  about  the  design  at  its  microarchitectural  level.  As  shown  in  Figure  5-1- 
1,  the  microarchitectural  level  is  one  of  the  five  possible  design  levels  in  Gajski’s  tripartite 
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Systetn  Level 


Structural  Domain 

Processors,  Memory,  Switches 

Hardware  M^ules 
ALUs,  MUas,  registers 


Behavioral  Domain 

Performance  Specification 
Algorithms,  Instruction  set 
Register  Trahsfers 


Physical  Domain 


Figure  5-1-1  The  Tripartite  Representation  of  a  Design 

In  Gajski  tripartite  representation,  a  design  can  be  described  in  three  separate  domains: 
behavioral,  structural,  and  physical.  Each  domain  is  represented  by  one  of  the  three  axes  that 
form  the  Y  chart  Within  each  domain,  there  are  five  design  levels.  I  have  added  concentric  cir¬ 
cles  to  Gajski  original  tripartite  representation  to  show  the  five  design  levels  graphically.  The 
design  levels  can  be  viewed  as  different  levels  of  absuaction  and  each  design  level  represents  all 
the  informations  known  about  the  design  at  some  point  in  the  design  process. 


representation.  Since  each  design  level  is  shown  in  Figure  5-1-1  to  span  all  three  domains,  the 
term  microarchitectural  design  refer  to  the  step  at  which  the  designer  specifies  all  the  microarchi- 
tectural  level  features  in  the  behavioral,  stmctural,  and  physical  domains. 

The  microarchitectural  design  step  of  the  SPUR  CPU  is  shown  as  an  example  in  Figure  5- 
1-2.  This  figure  shows  the  SPUR  CPU  design  processing  discussed  in  Section  3.2  (Figure  3-2-1) 
with  respect  to  the  tripartite  representation.  This  representation  idealizes  the  SPUR  CPU  design 
process  into  a  purely  sequential  process  that  starts  in  the  performance  specification  in  the 
behavioral  domain  and  spirals  toward  the  final  product— the  layout  at  the  physical  domain.  This  is 
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Figure  5-1-2  The  Tripartite  Representation  of  the  SPUR  CPU  Design  Process 

The  four  major  steps  in  the  SPUR  design  process-Specification,  Macroarchitectural  Design,  Mi- 
croarchitectural  Design,  and  Implementation-are  shown  here  with  respect  to  the  trip^te 
representation.  The  three  products  of  the  microarchitectural  design  step  are  the  behavioral 
description  in  the  behavioral  domain,  a  set  of  micro-modules  (specifies  in  block  diagrams)  in  the 
structural  domain,  and  a  floor  plan  in  the  physical  domain.  The  ALU  and  Shifter  are  examples  of 
micro-modules.  On  the  other  hand,  the  Operand  Supplier  and  Functional  Unit  are  examples  of 
macro-modules. 


an  idealized  picture  because,  in  practice,  the  sub-steps  and  products  within  the  four  major  steps 
cannot  be  defined  as  clearly  as  shown. 

5.1.2.  Microarchitectural  Design  Issues 

The  most  general  approach  for  handling  the  microarchitectural  design  problem  involves  two 
steps: 


(1)  Design  the  datapath,  and 
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(2)  Design  a  controller  that  controls  the  datapath. 

This  procedure,  however,  is  so  general  that  telling  it  to  an  inexperienced  designer  is  about  as 
helpful  as  telling  someone  who  is  afraid  of  flying  that  the  only  danger  in  aviation  is  hitting  the 
ground.  There  are  just  too  many  tasks  in  these  two  steps.  Fortunately,  I  can  give  more  direct 
advice:  All  the  tasks  involve  making  decisions  concerning  certain  issues.  The  designer  can 
approach  these  two  complex  steps  systematically  by  asking  himself  what  the  important  issues  are 
and  finding  solutions  to  them.  Based  on  the  SPUR  CPU  design  experience,  I  think  these  are  the 
six  important  issues  affecting  microarchitectural  design: 

(1)  OfT-chip  Communication 

Off-chip  communication  has  always  been  a  bottleneck  due  to  the  limited  number  of  pins 
and  it  is  worse  for  output  ports  because  of  power  considerations  [PaSSO].  This  problem  is 
more  severe  for  modem  microprocessors  which  communicate  not  only  with  memory  but 
also  have  to  communicate  with  coprocessors,  memory  management  units,  and  other 
microprocessors  in  a  multiprocessor  system. 

(2)  Pipeline  and  Clocking 

A  longer  pipeline  usually  leads  to  a  shorter  cycle  time.  The  performance  gain  from  a  shorter 
cycle  time,  however,  may  be  lost  due  to  the  increased  cost  in  branches  and  data  hazards. 
Alternatively,  a  shorter  pipeline  is  easier  to  control  and  the  penalties  are  smaller  for 
branches  and  data  hazards.  But  a  shorter  pipeline  usually  requires  more  clock  phases  per 
cycle,  leading  to  a  more  complicated  clock  distribution  network,  a  more  severe  clock  skew 
problem,  and  ultimately  a  longer  cycle  time. 

(3)  Micro-Modules  Selection 

The  microarchitect  must  select  a  set  of  micro-modules  to  implement  the  instruction  set  and 
other  macroarchitectural  features  specified  by  the  macroarchitect. 

(4)  Resources  Allocation 

On-chip  storage  trade-offs  such  as  the  size  of  register  file  versus  the  size  of  the  instruction 
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cache  is  a  very  interesting  problem  by  itself  [G0H86].  This  problem  can  be  generalized  to 
include  the  process  of  allocating  resources  to  the  set  of  micro-modules. 

(5)  On-chip  Interaction 

A  clean  microarchitectural  design  can  restrict  most  on-chip  interactions  to  internal  busses. 
Certain  functions  such  as  trap  handling,  inherendy  involve  many  on-chip  components  and 
interaction  is  unavoidable.  Certain  instructions  also  have  the  tendency  to  involve  many  on- 
chip  components  and  these  instructions  should  be  avoided. 

(6)  Floor  Planning 

The  microarchitecture  must  eventually  be  implemented  on  the  limited  area  of  a  silicon  chip. 
The  microarchitect  must  decide  how  to  place  the  set  of  macro-modules  according  to  their 
sizes,  aspect  ratios,  and  connections  between  the  macro-modules. 

I  call  these  issues  the  microarchitectural  issues.  They  can  be  grouped  into  three  groups  with 
respect  to  the  three  domains  in  Gajski  tripartite  representation. 

Behavioral  Issues 

Off-chip  communication,  and  pipeline  and  clocking  are  behavioral  issues. 

Structural  Issues 

Micro-modules  selection,  on-chip  interaction,  and  resources  allocation  are  structural  issues. 
Physical  Issue 

Floor  planning  is  a  physical  issue. 

A  microarchitect  must  answer  some  tough  questions  concerning  these  issues  when  he 
designs  the  datapath  and  the  controller.  His  decisions  on  these  issues  will  have  a  direct  effect  on 
the  performance,  resources,  and  complexity  tradeoffs. 

5.1.3.  A  Systematic  Approach  to  Microarchitectural  Issues 

A  systematic  approach  to  microarchitectural  design  must  begin  with  a  systematic  approach 
to  the  microarchitectural  issues.  Ideally,  the  microarchitect  would  like  to  tackle  one 
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microarchitectural  issue  at  a  time.  Unfortunately,  all  microarchitectural  issues  are  interrelated. 
Decisions  concerning  one  issue  usually  lead  to  (or  restrict)  decisions  concerning  the  others  issues. 
A  systematic  approach  to  these  issues  must  take  these  interrelations  into  account.  Here  is  a  sys¬ 
tematic  approach  I  recommended: 

Before  making  any  important  decisions  concerning  any  microarchitectural  issue  or  issues, 
the  microarchitect  should: 

(1)  List  all  the  unanswered  questions  concerning  each  issue. 

(2)  Constmct  a  model  that  can  isolate  the  issue  or  issues. 

(3)  Use  the  model  to  conduct  quantitative  experiments  to  answer  the  questions. 

The  modeling  language  does  not  have  to  be  a  hardware  description  language.  The  goal  is  to 
isolate  certain  aspects  of  the  microarchitectural  design  at  a  time.  The  model  the  microarchitect 
constmcts  and  the  experiments  he  conducts  must  take  into  account  the  characteristics  of  the 
underlying  technology  and  implementation  considerations.  There  are  a  couple  of  interesting  ques¬ 
tions  concerning  this  approach: 

•  How  can  the  microarchitecture  be  modeled  such  that  the  microarchitect  can  examine  a  sub¬ 
set  of  the  microarchitectural  issues  at  a  time? 

•  What  are  the  important  parameters  to  be  measured  in  the  experiments  such  that  the  microar¬ 
chitect  can  make  quantitative  decisions  concerning  the  issue? 

Before  I  try  to  answer  these  questions,  I  review  background  studies  on  systematic  approaches  to 
the  general  design  problem. 

5.2.  Background  Studies 

Hardware  description  languages  and  silicon  compilers  are  two  areas  to  look  for  ideas  that 
can  help  us  in  developing  a  systematic  approach  to  microarchitectural  design,  because  microar¬ 
chitecture  is  just  one  possible  representation  of  the  hardware  to  be  implemented  in  silicon.  SiU- 
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con  compiler  research  can  be  considered  as  an  extension  of  hardware  description  language 
research  because  once  the  hardware  is  described  in  a  machine  readable  form,  you  probably  want 
to  generate  the  silicon  automatically-the  goal  of  silicon  compiler  research.  Hardware  description 
language  research,  on  the  other  hand,  does  not  limit  its  focus  to  integrated  circuit.  Hardware 
description  language  research  must  also  study  hardware  at  board  and  system  levels.  Furthermore, 
hardware  description  language  must  also  investigate  problems  such  as  foimal  verification,  com¬ 
pilation  facilities,  access  to  program  libraries,  version  control,  and  standardization.  Therefore 
with  respect  to  the  scope  of  their  research,  silicon  compiler  research  can  also  be  considered  as  a 
subset  of  the  hardware  description  language  research. 

5,2.1.  Hardware  Description  Languages 

Hardware  description  languages  can  be  defined  as  computer  languages  for  describing,  docu¬ 
menting,  simulating,  and  synthesizing  digital  systems  with  the  aid  of  a  computer  [Su77].  Accord¬ 
ing  to  Chu  [Chu74],  describing  digital  system  in  computer  language  is  nothing  new: 

The  use  of  computer  languages  to  describe  digital  system  designs  can  be  traced  back  to 
Shannon’s  work  on  switching  circuit  in  1939,  Aiken’s  work  on  switching  theory  at  Harvard  in 
the  1940’s,  the  logic  diagrams  at  MIT  and  the  National  Bureau  of  Standards  in  the  late  1940’s, 
the  flipflop  equations  in  the  1950’s,  and  the  register  languages  in  the  1960’s. 

Yaohan  Chu,  Why  Do  We  Need  Computer  Hardware  Description  Languages? 

Computer,  December  1974,  Page  18 

Hardware  description  languages’  application,  however,  was  not  wide  spread  until  the  early  1970s 
when  researchers  started  using  them  as  documentation,  simulation,  and  teaching  tools.  By  the  late 
1970s  [Su77]  hardware  description  languages  were  well  established  as  documentation  and  simu¬ 
lation  tools.  There  was  also  an  attempt  to  use  hardware  description  language  to  describe  a  new 
invention  at  that  time— the  microprocessor  [Lip75].  Two  interesting  research  topics  in  hardware 
description  languages  emerged  in  the  late  1970s  are: 

Digital  System  Analysis  and  Evaluation 

The  goal  was  to  find  ways  to  evaluate  the  effectiveness  and  prove  the  correctness  of  a  pro- 
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posed  digital  system  by  simply  analyzing  the  system’s  description  in  a  hardware  description 
language.  This  would  eliminate  the  need  for  time  consuming  simulation. 

Structured  Design  of  Digital  System 

A  digital  system  described  in  a  well  designed  hardware  description  language  should  be  easy 
to  partition  into  modules  that  are  easy  to  build  and  design.  This  will  encourage  designer  to 
use  relatively  independent  modules. 


Figure  5-2-1  Classification  of  Hardware  Description  Languages 

At  each  domain,  there  are  certain  design  levels  where  no  formal  hardware  description  language 
exists.  These  levels  are  usually  described  informally  and  the  informal  ways  to  describe  them  are 
shown  inside  parenthesis.  The  behavioral  domain  is  the  domain  most  covered  by  hardware 
description  languages.  As  a  matter  of  fact,  most  so  call  hardware  description  languages  are 
languages  that  describe  the  microarchitectural  level  of  the  the  behavioral  domain. 
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These  two  topics  remain  important  driving  forces  in  hardware  description  language  research 
in  the  1980s.  Furthermore,  the  trend  of  the  1980s  is  more  formal  applications  of  hardware 
description  languages  and  synthesis  of  hardware  from  formal  machine  description.  A  standard¬ 
ized  universal  hardware  description  language  was  a  goal  since  the  eaiiy  1970s  but  was  never  real¬ 
ized.  I  think  the  goal  of  having  a  standardized  universal  hardware  description  language  is  hard  to 
achieve  because  there  are  different  requirements  for  different  applications.  This  is  illustrated  in 
Rgure  5-2-2,  where  I  have  borrowed  Gajski’s  ideas  (Figure5-1-1)  and  classified  hardware 
description  languages  into  five  levels  and  three  domains. 

In  the  behavior  domain,  the  system  level  behavior  is  usually  specified  informally  in  textual 
form  as  performance  specification.  The  macroarchitectural  level  behavior  can  be  modeled  by  an 
instruction  level  simulator  written  in  a  high  level  programming  languages  such  as  C.  The 
microarchitectural  level  behavioral  can  be  described  by  register  transfer  languages  such  as  ISP’. 
Logic  level  behavior  can  be  described  by  Boolean  equations.  Finally,  the  SPICE  input  deck  can 
be  used  to  describe  the  circuit  level  behavior. 

In  the  physical  domain,  circuit  and  logic  levels  are  probably  the  only  levels  need  to  be 
described  in  machine  readable  form.  CIF  is  the  most  common  language  for  describing  physical 
characteristics  at  the  circuit  level-layout  (see  Figure  5-1-1).  On  the  other  hand,  procedural  design 
languages  such  as  ICL  and  DPL  are  common  languages  for  describing  physical  characteristics  at 
the  logic  level.  All  other  levels  are  usually  described  informally  in  floor  plans  that  have  different 
levels  of  details. 

In  the  structural  domain,  PMS  at  the  system  level  is  the  only  well  known  hardware  descrip¬ 
tion  language.  All  other  levels  are  described  informally  by  diagrams  and  netlists.  Here  are  three 
reasons  for  this  lack  of  hardware  description  languages  in  the  structural  domain: 

•  Most  hardware  description  languages  are  modeled  after  high  level  programming  languages. 

They  are  good  at  capturing  a  design’s  behavior,  but  contain  little  structural  information. 
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Figure  5-2-2  Pure  Top  Down  Design  Methodology 

The  steps  of  the  pure  top  down  design  methodology  form  an  inward  spiral.  For  simplicity,  feed¬ 
back  paths  between  each  step  are  not  shown.  However,  these  feedback  paths  are  the  reasons  why 
iteration  is  necessary.  The  pure  bottom  up  design  methodology  is  exactly  opposite-an  outward 
spiral.  The  reader  can  get  a  mental  picture  of  the  pure  bottom  up  design  methodology  by  revers¬ 
ing  the  arrow  heads. 


•  Diagrams  consisting  of  black  boxes  and  connections  are  natural  ways  to  describe  structure. 
These  diagrams  may  have  different  levels  of  detail  at  different  design  levels  but  are  similar 
structurally.  Therefore  it  is  conceivable  to  use  PMS  for  all  levels  in  the  structural  domain. 

•  Most  current  custom  design  methodologies  require  the  designer  to  woric  in  the  structural 
domain.  The  graphical  CAD  tools  they  use  then  capture  the  structural  information  impli¬ 
citly  and  eliminate  the  need  for  explicit  description  of  the  structure  using  formal  description 
languages.  For  example,  the  Magic  .ext  files  [SMH85]  can  be  considered  as  an  implicit 
hardware  description  languages  that  describe  the  netlisL 
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Human  designers  prefer  to  work  in  the  behavioral  domain,  but  their  work  in  the  behavioral 
domain  must  eventually  be  transformed  to  the  physical  domain.  Since  work  in  the  physical 
domain  is  tedious  (but  necessary),  designers  prefer  this  transfonnation  to  be  done  automatically. 
This  transformation,  however,  wiU  not  be  efficient  unless  important  structural  information  is  pro¬ 
vided  because  the  structural  domain  is  the  bridge  between  the  behavioral  and  the  physical 
domain  Unfortunately,  the  lack  of  hardware  description  languages  in  the  structural  domain 
makes  it  hard  to  express  structural  information  in  machine  readable  form.  Consequently, 
designers  end  up  doing  more  woik  in  the  structural  and  physical  domains  than  they  prefer.  In 
order  to  reduce  the  manual  labor  in  the  structural  and  physical  domains,  researchers  must  pay 
more  attention  to  structural  information  representation. 

5.2.2.  Silicon  Compilers 

The  term  silicon  compiler  was  first  used  by  Dave  Johannsen  of  Caltech  back  in  1981 
[JohSl].  The  silicon  compiler  concept  was  inspired  by  the  pure  top-down  design  methodology.  In 
Figure  5-2-2  above,  I  have  drawn  my  view  of  the  pure  top-down  design  methodology  with  respect 
to  Gajski’s  tripartite  representation  (Figure  5-1-1).  In  this  view,  the  steps  of  the  pure  top  down 
design  methodology  form  an  inward  spiral  that  starts  at  the  system  level  of  the  behavioral  domain 
(performance  specification)  and  ends  at  the  circuit  level  of  the  physical  domain  Gayout).  The  ulti¬ 
mate  goal  of  silicon  compiler  is  to  carry  out  this  inward  spiral  automatically. 

Early  silicon  compilers  were  proposed  to  carry  out  the  entire  synthesis  process.  In  order  to 
simplify  this  complex  task,  a  target  technology  was  usually  assumed  and  a  fixed  floor  plan  was 
chosen  by  human  designers.  In  fact,  as  illustrated  in  Figure  5-2-3,  the  highest  level  input  the 
early  primitive  silicon  compilers  could  accept  was  the  logic  level  description  in  the  behavioral 
domaia  Notice  that  the  human  designer  must  carry  out  all  the  design  steps  manually  up  to  the 
logic  level.  According  to  Newton  and  Sangiovanni-VincenteUi  [NeS86],  the  most  important  con¬ 
tributions  of  the  early  silicon  compiler  research  is  the  development  of  the  Procedural  Design 
Language^-hardware  description  language  in  the  physical  domain  (see  Figure  5-2-1). 
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The  current  goal  of  most  silicon  compiler  research  is  no  longer  to  carry  out  the  entire  syn¬ 
thesis  process  by  one  single  program.  The  current  emphasis  is  to  create  a  "silicon  compiler  design 
environment."  This  environment  is  illustrated  in  Figure  5-2-4  in  which  the  synthesis  process  is 
divided  into  stages.  CAD  tools  are  then  developed  to  optimize  resource  allocation  and  perfor¬ 
mance  at  each  stage,  and  to  automate  the  transformation  from  one  stage  to  the  other.  The  com¬ 
mon  input  of  a  modem  silicon  compiler  design  environment  is  the  register  transfer  description. 
The  Yorktown  Silicon  Compiler  at  IBM  [Cam87]  is  one  example.  Other  more  ambitious  projects, 
such  as  the  Design  Automation  Assistant  (DAA)  at  AT&T  BeU  Laboratories  [Kow85],  accept 
input  at  the  macroarchitectural  level  of  the  behavioral  domain-algorithmic  description. 
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Figure  5-2-4  Modern  Silicon  Compiler  Design  Environment 


The  key  words  here  are  "design  environment"  In  the  silicon  compiler  design  environment,  a  set 
of  CAD  tools  are  available  to  optimize  and  automate  each  step  of  the  synthesis  process.  Most 
modem  silicon  compiler  design  environment  can  accept  inputs  at  the  microarchitectural  level  of 
the  behavioral  domain  and  generate  layout  automatically.  For  some  applications  such  as  digital 
signal  processing,  and  some  ambitious  silicon  compiler  projects,  they  can  even  accept  inputs  at 
the  macroarchitectural  level  of  the  behavioral  domain. 


The  translation  from  the  algorithmic  description  to  the  register  transfer  description  (Step  4, 
5,  and  6  in  Figure  5-2-4)  is  not  a  trivial  task  except  for  some  very  specific  application  such  as 
digital  signal  processing.  For  more  complex  applications  such  as  CPU  design,  this  translation 
usually  requires  the  use  of  some  knowledge-based  expert  system  programming  techniques.  New¬ 
ton  and  and  Sangiovanni-Vincentelli  [NeS86]  said  that  in  the  future,  procedural  design  systems 
and  knowledge-based  expert  systems  are  crucial  for  the  development  of  future  synthesis  system. 
They  also  believe  that  the  major  components  of  a  synthesis  system  are: 

(1)  Procedural  Design  and  Module  Generation-Step  7,  Step  8,  and  Step  9  in  Figure  5-2-4. 
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(2)  Logic  synthesis-Step  10,  Step  1 1,  and  Step  12  in  Figure  5-2-4. 

(3)  Physical  synthesis-Step  13  and  Step  14  in  Figure  5-2-4. 

5.2.3.  Meet  in  the  Middle  Approach 

There  is  one  major  philosophical  difference  between  the  goal  of  this  chapter  and  research  in 
hardware  description  languages  and  silicon  compilers  which  is  based  strongly  on  computer  sci¬ 
ence  theory.  Their  goal  is  to  introduce  a  new  theory-based  approach  to  the  design  process  and 
ultimately  automate  it.  This  chapter,  on  the  other  hand,  is  based  on  the  design  process  that  was 
used  to  create  the  SPUR  CPU.  I  will  look  at  ways  to  make  this  process  more  systematic  and 


Physical  Domain 


Figure  5-2-5  Meet  In  The  Middle  Approach 

In  the  meet  in  the  middle  approach,  system  designers  start  at  the  performance  specification  in  the 
behavioral  domain  and  work  their  way  down.  At  the  same  time,  logic  and  circuit  designers  start 
at  the  layout  in  the  physical  domain  and  work  their  way  up.  They  meet  at  the  microarchitecture 
level  of  the  physical  domain. 
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efficient  based  on  the  lessons  I  learned.  Furthemiore,  the  goal  of  most  hardware  description 
language  and  silicon  compiler  research  is  to  develop  CAD  tools  that  can  automate  the  top-down 
design  methodology.  However,  as  many  custom  VLSI  chip  designers  have  learned,  the  top-down 
design  methodology  is  not  as  practical  as  the  "meet-in-the-middle"  approach.  My  view  of  the 
"meet-in-the-middle"  approach  is  shown  in  Figure  5-2-5. 

Cathedral  [DRS86]  is  a  silicon  compiler  for  digital  signal  processing  chips  that  is  based  on 
the  meet  at  the  middle  approach.  The  user  of  Cathedral  must  provide  "structural  hints"  to  the 
compiler  to  aid  the  behavioral  to  physical  compilation.  Furthermore,  instead  of  using  a  module 
generator  to  generate  the  layout.  Cathedral’s  compilation  is  based  on  silicon  modules  that  are 
are  designed  by  layout  designer  and  are  composed  of  functional  building  blocks.  For  example,  if 
the  SPUR  CPU  is  to  be  compiled  by  a  Cathedral  type  compilers,  the  Operand  Supplier  and  the 
Functional  Unit  (see  Figure  2-3-1)  are  two  of  the  silicon  modules.  The  set  of  silicon  modules 
avaUable  to  Cathedral  was  carefuUy  restricted.  According  to  the  authors,  this  restriction  was  the 
key  of  Cathedral’s  success.  The  authors  also  believed  in  order  to  develop  a  Cathedral  type  com¬ 
piler,  one  must  foUow  a  series  of  five  steps: 

(1)  Define  a  wide,  but  concise  class  of  system  design  applications. 

(2)  Define  a  target  architecture  and  its  associated  layout  style. 

(3)  Define  a  design  strategy. 

(4)  Define  the  behavioral  language  that  models  the  microarchitecture  and  silicon  modules. 

(5)  Then  and  only  then  develop  the  CAD  tools. 

In  our  SPUR  CPU  example,  the  result  of  Step  1  is  the  specification  of  a  general  purpose 
CPU  with  LISP  support.  In  Step  2,  the  target  architecture  is  a  RISC-style  processor  that  does  not 
use  microcode,  and  the  layout  style  is  the  Mead  &  Conway  style.  Step  3,  Step  4,  and  Step  5  are 
the  major  steps  toward  a  a  systematic  approach  to  the  microarchitectural  design  problem.  They 
are  discussed  in  Section  5.3  in  more  detail. 
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53.  Steps  Toward  a  Systematic  Approach  to  Microarchitectural  Design 

The  major  steps  toward  a  systematic  approach  to  microarchitectural  design  were  discussed 
briefly  at  the  end  of  Section  5.2. 1  have  added  some  of  my  own  ideas  and  restated  them  as: 

(1)  Propose  a  general  design  strategy. 

(2)  Develop  models  that  can  capture  the  microarchitecture’s  behavioral,  structural,  and  physi¬ 
cal  features. 

(3)  Build  or  propose  CAD  tools  that  can  aid  the  last  two  steps. 

(4)  Refine  the  general  design  strategy  proposed  in  Step  1  and  iterate  again. 

The  first  three  steps  are  based  on  the  discussion  at  the  end  of  Section  5.2.  I  added  the  last 
step  to  introduce  feedback  into  the  approach.  These  steps  are  discussed  in  Section  5.3.1,  Section 
5.3.2,  Section  5.3.3,  and  Section  5.3.4,  respectively.  In  order  to  limit  the  scope  of  my  research,  I 
wiU  focus  the  discussion  on  RlSC-style  processors  that  do  not  use  microcode.  The  discussion  are 
based  on  the  the  following  observations: 

(1)  For  a  RISC-style  processor.  Just  looking  at  the  instmetion  set  can  tell  you  a  great  deal 
about  the  microarchitecture. 

(2)  The  model  or  models  for  the  microarchitecture  must  be  abstract  enough  for  making  high 
level  design  decisions  and  detailed  enough  for  logic  specification  and  simulation. 

(3)  We  must  reduce  the  time  we  spent  for  verification  in  order  to  improve  the  efficiency  of 
the  design  process. 

(4)  We  must  document  all  the  important  design  decisions  and  the  assumptions  or  facts  on 
which  these  decisions  are  based. 

5  J.l.  The  Design  Strategy 

In  the  most  general  term,  the  microarchitectural  design  problem  can  be  divided  into  two 
tasks:  (1)  design  the  datapath,  and  (2)  design  a  controller  that  controls  the  datapath.  This  is  shown 
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Figure  5-3-1  Design  Strategy 

The  microarchitectural  design  problem  can  be  divided  into  two  tasks:  datapath  design  and  con¬ 
troller  design.  The  datapath  can  be  further  divided  into  macro-modules  and  then  micro-modules. 
The  controller  can  be  further  divided  into  two  parts:  one  controls  instruction  execution  and  anoth¬ 
er  controls  unusual  conditions.  The  handling  of  unusual  conditions,  however,  can  be  integrated 
into  the  part  that  controls  instruction  execution  via  the  use  of  internal  instructions  (see  Section 
2.3.3  and  Section  2.4.3). 


graphically  in  Figure  5-3-1.  The  design  of  the  datapath  is  straightforward.  First,  the  microarchi¬ 
tect  must  select  a  set  of  macro-modules  needed  to  implement  the  instruction  set.  Examples  of 
macro-modules  in  the  SPUR  CPU  are  the  Operand  Supplier  and  the  Functional  Unit.  After  the 
microarchitect  is  satisfied  with  the  behavior  of  these  macro-modules,  he  can  expand  the  macro- 
modules  into  micro-modules.  Examples  of  micro-modules  in  the  SPUR  CPU  are  ALU  and  the 
Shifter. 

The  controller  of  a  RISC  machine  can  be  divided  into  two  relatively  independent  com¬ 
ponents  (see  Section  2.4.3):  the  major  component  that  controls  instruction  execution  and  a  sup¬ 
porting  component  that  controls  unusual  internal  and  external  conditions.  In  this  design  strategy, 
unusual  conditions  are  considered  as  secondary  effects  and  they  will  be  examined  in  terms  of  how 
they  will  affect  the  primary  event-instruction  execution.  Furthermore,  as  discussed  in  Section 
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2.4.3,  imusual  condition  handling  as  well  as  many  other  control  sequences  can  be  reduced  to 
sequences  of  internal  instructions  that  are  similar  to  the  regular  instructions  in  the  instruction  set 
Therefore,  the  microarchitect  can  and  should  concentrate  initially  only  on  the  primary 
event-instruction  execution,  and  temporarily  ignore  the  unusual  conditions  and  other  complex 
control  sequences. 

The  micioarchitect  can  build  the  abstract  model  of  the  microarchitecture  (see  Figure  5-3-1) 
according  to  the  instruction  set  by  concentrating  only  on  instruction  execution.  All  he  has  to  do  is 
to  select  a  set  of  macro-modules  that  are  needed  by  the  instruction  set  and  then  design  the  Instrac- 
tion  Execution  Controller  that  controls  the  macro-modules.  After  the  microarchitect  is  satisfied 
with  the  behavior  of  this  abstract  model,  he  can  then  turn  the  abstract  model  into  the  expanded 
model  (see  Figure  5-3-1)  by  expanding  macro-modules  into  micro-modules  and  by  taking 
unusual  conditions  detection  into  account  The  abstract  and  expanded  models  are  discussed  in 
Section  5.3.2.  Notice  that  I  have  changed  the  definitions  of  macro  and  micro-modules  slighUy  in 
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Figure  5-3-2  The  Abstract  Model 


The  microarchitect  constructs  this  model  to  study  possible  insuuction  execution  schemes.  The 
microarchitect  begins  the  construction  by  first  selecting  a  set  of  high  level  macro-modules  ac¬ 
cording  to  the  execution  scheme  on  his  mind.  He  then  design  a  simple  controller  to  translate  the 
instruction  set  into  a  set  of  high  level  control  signals  to  control  the  high  level  macro-modules.  Ex¬ 
amples  of  macro-modules  in  the  SPUR  CPU  are  Cache  Controller  Interface  and  Operand  Sup¬ 
plier. 
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this  section.  Instead  of  using  them  to  describe  components  in  both  the  datapath  and  the  controller, 
I  have  used  them  exclusively  for  components  in  the  datapath.  These  definitions  will  be  followed 
for  the  rest  of  the  discussion. 

5  J.2.  Different  Models  for  Different  Issues 

The  instruction  set,  the  external  interface,  and  performance  requirements  are  usually  fixed  at 
the  microarchitectural  level.  Therefore,  the  first  task  of  the  microarchitect  is  to  develop  an 
instruction  execution  scheme  that  can  fulfill  the  external  interface  and  performance  requirements. 
The  type  of  model  the  microarchitect  needs  at  this  stage  is  the  abstract  model  shown  in  Figure  5- 
3-2.  He  will  use  this  model  to  verify  that  the  external  interface  and  perfonnance  requirements  are 
indeed  met.  Furthermore,  he  will  also  use  this  model  to  answer  questions  concerning  the 
microarchitectural  issues;  (1)  off-chip  communications,  and  (2)  pipeline  and  clocking. 


Instruction  Set  Unusual  Conditions 


Figure  5-3-3  The  Expanded  Model 

In  the  expanded  model,  the  macro-modules  in  abstract  model  (Figure  5-3-2)  are  expanded  into 
low  level  micro-modules.  The  controller  in  absttact  model  is  expanded  into  master  control,  local 
decoding  logic  blocks,  and  the  unusual  conditions  detection  logic.  ALU  and  SHIFTER  are  ex¬ 
amples  of  micro-modules  in  the  SPUR  CPU. 
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The  microarchitect  must  keep  the  abstract  model  as  simple  as  possible  so  that  the  effects  of 
his  design  decisions  can  be  identified  more  directly.  This  can  be  accomplished  by  ignoring  details 
that  have  little  effects  on  the  microarchitectural  issues  that  are  being  investigated.  Since  pipeline 
and  clocking  and  off-chip  communication  are  the  two  microarchitectural  issues  to  be  investigated 
by  the  abstract  model,  second  order  effects  such  as  unusual  conditions  handling  can  be  ignored. 
In  order  to  simplify  the  investigation  further,  the  microarchitect  may  also  want  to  group  instmc- 
tions  into  types  and  examine  how  the  abstract  model  will  execute  each  type  of  instruction  instead 
of  examining  individual  instructions.  In  this  simple  abstract  model,  where  second  order  effects 
are  being  ignored,  the  set  of  high  level  control  signals  is  a  good  indication  of  the  controller  com¬ 
plexity.  Similarly,  the  set  of  macro-modules  is  a  good  indication  of  datapath  complexity. 

Once  the  microarchitect  has  verified  the  external  interface  and  performance  requirements 
have  been  met,  he  must  move  onto  microarchitectural  issues  such  as  on-chip  interaction,  micro¬ 
modules  selection,  and  resource  allocation.  Since  these  microarchitectural  issues  require  a  more 
detailed  model,  he  must  expand  the  abstract  model  into  the  expanded  model  shown  in  Figure  5- 
3-3.  In  this  model,  the  macro-modules  are  expanded  into  micro-modules  and  logic  is  added  to 
detect  all  the  unusual  conditions.  Furthermore,  in  order  to  get  a  better  understanding  of  the 
micro-modules,  the  microarchitect  must  examine  how  the  expanded  model  executes  each  instruc¬ 
tion  instead  of  how  it  executes  each  type  of  instruction.  Finally,  if  the  instruction  is  not  provided 
directly  by  the  external  world,  an  instruction  supplier  must  be  added  to  this  model.  This  is  not 
shown  in  Figure  5-3-3  in  order  to  keep  this  figure  simple. 

The  microarchitect  should  be  able  to  learn  enough  about  the  microarchitecture  from  the 
abstract  and  expanded  models  that  he  can  draw  out  a  detailed  floor  plaa  The  floor  plan  can  be 
considered  as  a  physical  model.  Table  5-3-1  summarize  the  various  models  I  proposed  and  the 
microarchitectural  issues  each  model  investigates.  The  abstract  model  (Figure  5-3-2)  is  used 
mainly  to  investigate  behavioral  issues.  However,  structural  information  is  implied  in  the 
abstract  model  by  the  macro-modules  connection  scheme.  Similarly,  although  the  expanded 
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Table  5-3-1  The  Microarchitectural  Models  and  Issues 

"MAJOR"  means  that  issue  is  a  major  concern  of  that  model. 

"minor"  means  that  issue  is  a  minor  concern  of  that  model, 
means  that  issue  is  a  not  a  concern  of  that  model. 


model  (Figure  5-3-3)  is  used  mainly  to  investigate  stmctural  issues,  the  microarchitect  is  provid¬ 
ing  behavioral  information  when  he  describes  the  behavior  of  the  micro-modules. 

Either  the  abstract  or  the  expanded  model  can  be  used  to  estimate  the  average  number  of 
cycles  per  instruction  (C)  for  the  performance  model  (see  Section  4.1).  This  can  be  accomplished 
by  counting  the  number  of  cycles  either  model  lakes  to  execute  a  set  of  instructions  with  the 
proper  mix  of  instructions.  Both  models  can  also  be  used  to  estimate  the  cycle  time  T  for  the  per¬ 
formance  model.  This  can  be  accomplished  in  two  different  ways: 

(1)  A  simulator  for  the  model  should  be  able  to  perform  simple  timing  analysis  if  explicit 
timing  information  is  provided  for  each  module. 

(2)  A  simulator  should  be  able  trace  all  the  sequential  events  within  each  clock  cycle  and  the 
microarchitect  can  estimate  the  cycle  time  based  on  these  lists  of  events. 

In  the  first  approach,  there  is  always  the  danger  of  looking  at  the  wrong  critical  path  within  a 
module  and  subsequently  assigning  the  wrong  timing  information  to  the  module.  In  the  second 
approach,  the  trace  information  must  be  interpreted  and  this  can  be  cumbersome.  The  best  chance 
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of  success  is  to  use  Approach  1,  with  Approach  2  acting  as  a  check. 

53.3.  CAD  Tools  Considerations 

The  CAD  tools  needed  to  make  the  design  process  more  systematic  will  be  discussed  in 
more  details  in  Section  5.3.4,  Section  5.3.5,  and  Section  5.3.6  when  the  SPUR  CPU  is  used  as  an 
example  to  illustrate  different  stages  of  the  systematic  approach.  In  this  section,  I  make  some 
general  observations  concerning  CAD  tools. 

5.3.3.I.  Unifying  Different  Levels  of  Verification 

In  the  SPUR  CPU  design  process  (Figure  3-2-1),  the  macroarchitecture,  microarchitecture, 
and  the  layout  were  verified  sequentially  and  independenUy  by  instraction  level,  behavioral  level, 
and  switch  level  simulations,  respectively.  The  results  of  these  simulations  must  be  studied 
independently  by  the  macroarchitect,  microarchitect,  and  the  logic  designer.  One  important 
observation,  stated  in  Section  3.2.4,  was  that  this  independent  verification  strategy  required  a  lot 
of  human  interaction  time.  In  order  to  reduce  verification  time,  the  redundancy  between  different 
levels  of  verification  must  be  reduced.  In  the  current  SPUR  CPU  design  process,  behavioral  simu¬ 
lation  results  were  verified  by  comparing  them  with  the  instruction  level  simulation  results.  Simi¬ 
larly  switch  level  simulation  results  were  verified  by  comparing  them  with  the  behavioral  level 
simulation  results.  This  approach,  however,  still  requires  a  lot  of  human  interaction  and  format 
conversions  (see  Figure  3-2-4).  Mixed-level  simulation  is  a  better  approach. 

Ideally,  we  would  like  to  have  a  mixed-level  simulator  that  will  accept  modules  of  different 
levels  of  abstraction  during  various  stage  of  the  design  process.  Initially,  only  high  level  macro¬ 
modules  should  be  used  such  that  high  level  design  decisions  can  be  made.  These  high  level  deci¬ 
sions  create  a  rough  specification  of  each  macro-module  and  enable  each  of  them  to  be  replaced 
by  a  set  of  low  level  micro-modules.  In  order  to  verify  that  this  set  of  micro-modules  can  indeed 
replace  the  high  level  macro-module,  they  must  be  simulated  together  with  other  high  level 


macro-modules. 
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The  macro/micro  modules  substitution,  however,  should  be  a  two-way  street.  During 
mixed-level  simulation,  the  designer  should  be  able  to  extract  important  parameters  from  the  set 
of  low  level  micro-modules  being  simulated.  These  parameters  can  then  be  used  to  update  and 
modify  the  corresponding  high  level  macro-module.  After  the  important  parameters  are  extracted, 
the  set  of  low  level  micro-models  can  then  be  replaced  by  the  updated,  more  accurate,  high-level 
macro-model  to  reduce  simulation  time  when  we  simulate  other  low  level  micro-modules. 

5  J.3.2.  Timing  Verification 

The  SPUR  CPU  timing  was  verified  at  the  circuit  and  switch  levels.  At  the  circuit  level,  the 
circuit  designer  verified  the  timing  of  critical  circuits  by  SPICE  even  before  starting  the  layout. 
After  the  layout  of  these  circuits  was  completed,  the  circuit  designer  measured  the  parasitic  capa¬ 
citance  and  resistance  such  that  the  SPICE  models  could  be  updated  for  a  more  accurate  timing 
analysis.  In  the  switch  level,  the  timing  of  the  entire  CPU  was  verified  by  Crystal  [SMH85]  which 
extracted  the  critical  paths  from  the  switch  level  description.  The  exact  delay  of  these  critical 
paths  are  then  again  verified  by  SPICE  in  the  circuit  level. 

One  drawback  of  the  SPUR  CPU  approach  is  that  timing  verification  is  done  at  the  low 
level  only,  and  working  at  the  low  level  is  tedious.  Working  at  the  high  level  is  possible  here 
because  the  timing  requirements  at  the  low  level  are  direct  results  of  high  level  decisions.  In  order 
to  take  advantage  of  this  possibility,  we  need  a  mixed-level  timing  verifier.  A  mixed-level  timing 
verifier  will  enable  the  designer  to  use  the  high  level  work  to  drive  the  low  level  timing 
verification.  For  example,  if  at  the  high  level  the  designer  decides  that  the  Functional  Unit  must 
have  a  critical  delay  less  than  M  during  phase  N.  then  any  micro-modules  that  are  part  of  the 
Functional  Unit  (Example:  ALU,  SHIFTER)  must  aU  fulfill  this  same  requirement.  The  "Abstract 
Timing  Verifier"  by  Dave  Wallace  [WaS86]  is  an  example  of  mixed-level  timing  verifier. 

Another  drawback  of  the  SPUR  CPU  approach  is  that  timing  verification  is  done  com¬ 
pletely  independent  of  logic  simulation.  Like  most  timing  analyzers.  Crystal  does  not  care  nor 
know  anything  about  logic.  Consequenfly,  to  prevent  the  timing  analyzer  from  chasing  false 
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critical  paths,  the  user  must  place  "flow  control"  attributes  on  certain  transistors  [SMH85].  This 
takes  a  lot  of  time  and  can  also  be  unreliable.  The  switch  level  simulator  should  be  able  to  help 
the  designer  in  placing  these  "flow  control"  attributes  because  the  switch  level  simulator  knows 
the  direction  each  signal  propagates  during  switch  level  simulation. 

One  point  worth  noticing  is  that  timing  analyzers  such  as  Crystal  were  designed  at  a  time 
when  computer  time  was  relatively  expensive.  They  were  specialized  tools  designed  intentionally 
to  ignore  the  logic  aspect  of  the  circuit  such  they  can  nin  rapidly  in  a  relatively  slow  computing 
environment.  The  price  to  pay  was  human  preparation  time.  In  current  computing  environment, 
computer  time  is  relatively  cheap.  It  is  more  desirable  to  have  tools  that  require  less  human 
preparation  time  although  it  may  consume  much  more  computer  time. 

5J.3 J.  Documenting  the  Design  Decisions 

Most  practical  VLSI  projects  are  so  complex  that  nobody  can  specify  it  accurately  until 
some  work  has  been  done  on  it  Informality  is  a  powerful  strategy  for  dealing  with  complexity 
because  it  allows  the  designer  to  describe  the  big  picture  without  worrying  about  the  details. 
Therefore,  an  imprecise  but  brief  specification,  for  the  lack  of  a  better  word,  is  good  at  the  begin¬ 
ning  of  an  VLSI  project  Instead  of  demanding  an  complete  precise  specification  from  the  start,  a 
good  design  process  or  system  should  help  the  user  to  specify  and  refine  the  specifications  con¬ 
tinuously.  This  is  accomplished  by  continuously  demanding  the  designer  to  answer  the  following 
types  of  questions: 

•  Given  a  number  of  interacting  design  objectives  and  goals,  how  should  I  prioritize  them? 

•  Given  a  number  of  alternative  choices,  which  alternatives  should  I  pick? 

Unfortunately,  due  to  the  imprecise  specification,  making  these  decisions  are  not  always  easy. 
Consequently,  a  design  decision  is  frequently  nothing  but  an  educated  guess  and  the  design  pro¬ 
cess  is  an  evolution  process.  At  each  stage,  the  design  is  a  proposal  whose  correctness  and  effec¬ 
tiveness  must  be  proved.  The  proof  can  be  performed  either  by  formal  mathematical  techmques 


Chapter  5:  A  Systematic  Approach 


166 


or  experiments.  Mathematical  techniques,  however,  are  for  mathematicians-engmeers  should 
always  answer  their  doubts  by  experiments! 

Since  the  design  at  any  stage  is  only  a  proposal  that  must  be  evaluated  by  experiments, 
design  decisions  that  led  to  that  design  must  be  documented  systematically.  The  above  observa¬ 
tion  is  the  basis  for  the  development  of  the  theory  of  plausibility  design  [HAD88].  Since  the 
underlying  philosophy  of  this  thesis  is  to  keep  things  simple  and  practical,  we  will  not  go  into  the 
details  of  the  theory  of  plausibility  design.  But  I  do  want  to  point  out  the  two  things  the  theory  of 
plausible  design  tries  to  address: 

(1)  The  sequences  of  design  decisions  and  the  cause-effect  relationships  between  the  design 
decisions. 

(2)  The  assumptions  or  the  evidence  used  by  the  designer  to  justify  his  design  decisions. 

In  other  words,  we  must  find  a  systematic  way  to  document  not  just  the  decisions  but  also  the 
assumptions  and  evidences  behind  aU  the  decisions.  However,  in  order  to  keep  the  procedure 
simple,  the  designer  should  only  document  the  important  decisions  and  decisions  that  are  based 
on  questionable  assumptions. 

5J.4.  Stages  of  the  Systematic  Approach 

Based  on  the  above  discussions,  I  believe  a  systematic  approach  to  microarchitectural 
design  should  have  three  stages:  the  abstract  stage,  the  expansion  stage,  and  the  floor  planning 
stage. 

The  Abstract  Stage 

Construct  the  abstract  model  (Figure  5-3-2)  and  then  use  it  to  conduct  primary  studies  on 
microarchitectural  issues:  (1)  off-chip  communication,  and  (2)  pipeline  and  clocking. 

The  Expansion  Stage 

Construct  the  expanded  model  (Figure  5-3-3)  by  expanding  the  macro-modules  and  the  con¬ 
troller  in  the  abstract  model.  The  microarchitectural  issues  to  be  studied  here  are:  (1) 


Chapter  5:  A  Systematic  Approach 


167 


micro-modules  selection,  (2)  resources  allocation,  and  (3)  on-chip  interaction. 

The  Floor  Planning  Stage 

Based  on  the  information  we  learned  from  the  abstract  and  expanded  model,  construct  a 
detailed  floor  plan  for  the  microarchitecture. 

The  abstract  stage  and  the  expansion  stage  should  be  repeated  for  alternative  microarchitec¬ 
tures  until  satisfactory  solutions  are  found  for  all  microarchitectural  issues.  Mixed-level  simula¬ 
tion  that  uses  a  mixture  of  macro-modules  and  micro-modules  can  be  used.  During  simulation,  all 
important  and  questionable  design  decisions  must  be  documented  systematically.  Instead  of  using 
toy  examples  to  illustrate  this  procedure,  I  will  use  the  SPUR  CPU  to  illustrate  the  details  of  this 
procedure  in  Section  5.4,  Section  5.5,  and  Section  5.6. 

5,4.  The  Abstract  Stage  of  Microarchitectural  Design 

In  the  beginning,  the  macroarchitect  created  the  instruction  set  and  the  interface 
specifications.  The  microarchitect  must  then  find  a  pipeline  and  clocking  scheme  that  can  execute 
the  instruction  set  and  derive  an  off-chip  communication  strategy  that  can  satisfy  the  interface 
specifications.  As  I  will  explain  later,  the  pipeline  and  clocking  and  off-chip  communication 
issues  are  closely  related. 

5.4.1.  Off-Chip  Communication 

The  general  off-chip  communication  problem  for  a  microprocessor  is  shown  in  Figure  5-4-1 
to  be  a  three-port  problem.  Most  modem  microprocessors  include  the  Instmction  Supplier  on 
chip.  The  microprocessor  then  only  has  to  communicate  off  chip  with  the  Data  Supplier  and  the 
Coprocessor(s).  No  matter  what  the  situation,  the  microarchitect  must  design  the  Data  Supplier 
port,  the  Coprocessor  port,  and  (if  necessary)  the  Instruction  Supplier  port  such  that  the  perfor¬ 
mance  goal  is  met  and  the  resources  and  complexity  requirements  are  still  within  the  constraints. 


Chapter  5:  A  Systematic  Approach 


168 


Figure  5-4-1  Off-Chip  Communication 

The  microprocessor  off-chip  communication  is  a  three-port  problem.  The  instruction  execution 
engine  must  communicate  with  the  data  supplier,  the  instruction  supplier,  and  the  coprocessor(s). 
Many  modem  microprocessors  have  an  internal  instruction  cache  which  eliminates  the  instruc¬ 
tion  supplier  interface.  Some  processor  even  include  complex  functions  such  as  floating  point 
operations  on-chip  to  eliminate  the  coprocessor  interface.  This  option  may  run  into  trouble  in  the 
future  when  more  complex  functions  are  desired  and  the  only  way  to  provide  them  is  via  copro¬ 
cessors. 


The  performance  specified  by  the  macroarchitect  is  usually  in  terms  of  clock  cycles.  The 
microarchitect  must  decide  when  to  drive  or  receive  the  interface  signals  within  a  given  cycle. 
The  two  limiting  resources  are  the  number  of  pins  available  and  the  power  to  drive  the  output 
pins.  As  far  as  the  number  of  pins  is  concerned,  the  sum  of  input  output  (Noa,),  bidirec¬ 
tional  (Nti),  and  power  pins  (2xNvdd)  must  be  smaller  than  or  equal  to  the  total  number  of  pins 
available  ^^avaHabu'^' 

Afia  ^oui  '^^bi  "t"  ^  ^ availabU  (5.4.1) 

The  number  of  power  pins  (2xNvdd)  is  twice  the  number  of  Vjd  pins  {Nvdd)  because  a  GND  pin  is 

needed  for  each  Vm  pin.  In  the  old  days,  one  pair  of  V^d  and  GND  was  be  sufficient.  But  today, 
due  to  high  switching  frequency  and  pin  inductance,  the  number  of  power  pins  needed  is  a  func¬ 
tion  of  the  switching  frequency  the  number  of  output  pin  (Nout),  and  the  number  of 

bidirectional  pin  (Nu): 
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Nvdd  =  fiinciF switch,  N out,  Nbi)  (5.4.2) 

John  Keller  [Kel85]  suggested  that  for  a  given  switching  frequency,  a  simple  solution  is  to  assign 
a  pair  of  power  pins  (one  Vdd  and  one  GND  pin)  for  each  group  of  M  output  or  bidirectional  pins. 
This  implies  the  number  of  V^a  pins  can  be  written  as: 

(5.4.3) 

If  one  just  look  at  Equation  5.4.3,  one  may  think  that  it  is  possible  to  reduce  the  number  of  power 
pins  {2xNvdd)  by  time  multiplexing.  This  is  not  the  case.  Although  time  multiplexing  wiU  reduce 
Nout  or  Nbi  or  both,  it  also  increases  the  switching  frequency  {F switch)-  According  to  Equation 
5.4.2,  this  increase  in  switch  frequency  will  negate  the  effects  of  the  reduction  in  Nout  or  Nbi  ■  hi 
order  to  rewrite  Equation  5.4.3  to  take  this  consideration  into  account,  I  define  the  term  logical 
output  Lout : 

Lout  =  The  number  of  signals  the  microprocessor  must  send  to  the  outside  world 
By  definition,  the  number  of  logical  outputs  (Lout)  will  not  change  by  time  multiplexing.  It  is  the 
sum  of  Nout  and  Nu  only  if  time  multiplexing  is  never  used  to  multiplex  more  than  one  output  sig¬ 
nal  onto  one  output  or  bidirectional  pin.  In  other  words: 

If  time  multiplexing  is  not  used;  Lout  =  Nout  + 

If  time  multiplexing  is  used:  Lout  >  Nout  +  Nbi 
Using  this  tenm  logical  output  (Lout),  I  can  rewrite  Equation  5.4.3  as: 

Nvid  =  (5-4.4) 

The  number  of  logical  outputs  Lout  is  used  in  Equation  5.4.4  to  emphasize  the  fact  that  the 
number  of  power  pins  needed  (2xNvdd)  will  not  change  by  time  multiplexing.  The  SPUR  CPU  has 
approximately  120  logical  outputs.  After  careful  considerations  of  the  pin  inductance,  switching 
frequency,  and  worst  case  loading,  the  SPUR  circuit  designers  [Jeo88]  decided  that  Af  =  6  is 
sufficient.  The  SPUR  CPU  therefore  has  20  pairs  of  and  GND  pins. 
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The  reason  why  the  number  of  power  pins  needed  cannot  be  reduced  by  time  multiplexing 
also  applies  when  considering  the  power  required  to  drive  the  output  pins  iPowerguijin')-  Since 
PoweroutjHA  cannot  be  reduced  by  time  multiplexing.  Power  out >«  in  general  is  not  a  function  of 
{Nout  +  Nbi).  The  power  required  to  drive  the  output  pins  is  also  a  function  of  the  number  of  logi¬ 
cal  outputs 

Power  out jiA  =  func  {F twitch ,  ^oux')  (5.4.5) 

In  conclusion,  while  time  multiplexing  can  reduce  the  number  of  physical  pins  (A^m,  Nout^ 
Nbi),  it  does  not  reduce  the  number  of  power  and  ground  pins.  It  also  does  not  reduce  the  power 
required  to  drive  the  output  pins.  Furthermore,  time  multiplexing  also  increases  the  complexity  of 
the  chip. 

Figure  5A-2  shows  the  simple  SPUR  CPU  off-chip  communication  strategy  that  does  not 
involve  time  multiplexing.  In  pursuing  this  simplest  solution,  a  dedicated  set  of  32  pins  is  allo¬ 
cated  to  the  coprocessor  interface  even  though  the  coprocessor  instruction  only  occurs  rarely. 


Figure  5-4-2  SPUR  CPU  OfT-Chip  Communication 

The  SPUR  CPU  designer  took  the  easist  way  out  and  pick  the  simplest  solution.  This  solution  ^- 
locate  a  separate  set  of  pins  for  data,  address,  and  the  coprocessor  interface.  The  coprocessor  in¬ 
terface  broadcasts  every  instruction  the  SPUR  CPU  receives  from  its  internal  instruction  cache  to 
the  coprocessor. 
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Before  accepting  any  solution,  the  microarchitect  must  make  sure  the  solution  meets  the  perfor¬ 
mance  requirements  and  are  within  the  resources  and  complexity  constraint.  Answering  questions 
such  as  these  below  helps  make  the  decision: 

•  Are  there  enough  pins  for  this  solution? 

•  How  much  power  does  it  take  to  drive  all  the  output  and  bidirectional  pins? 

•  What  kind  of  performance  does  this  solution  will  give? 

•  How  complex  is  it  to  debug  and  implement  this  solution? 

•  Is  this  the  most  efficient  way  to  use  the  limited  pin  resource? 

While  you  can  give  quantitative  answers  to  the  first  two  questions,  it  is  hard  to  give  absolute 
answers  to  the  last  three  questions.  It  is  much  easier  give  relative  answers  by  comparing  different 
solutions.  Furtheimore,  in  order  to  answer  aU  these  questions,  you  must  make  some  assumptions 
about  clocking. 

5.4.2.  Pipeline  and  Clocking 

aocking  schemes  must  be  studied  together  with  pipelining  because  the  longer  the 
pipeline— that  is  more  pipe  stages  for  each  instruction— the  shorter  the  potential  clock  cycle.  Most 
useful  work  is  done  during  the  high  time  of  the  clock  in  MOS  technology.  Therefore  the  number 
of  clock  phases  per  cycle  together  with  the  number  of  pipeline  stages  for  each  instruction  deter¬ 
mine  the  number  of  time  slots  in  which  useful  work  can  be  done.  This  is  illustrated  in  Figure  5- 
4-3,  where  the  SPUR  CPU  pipeline  and  clocking  scheme  are  used  as  an  example.  In  general,  the 
more  explicit  time  slots  the  easier  it  is  to  design  dynamic  logic  [Kon85].  Unfortunately,  more 
explicit  time  slots  also  means  that  more  time  will  be  wasted  between  time  slots  because  explicit 
non-overlap  dead  time  must  be  placed  between  phases  to  guard  against  clock  skew  problems  (see 
Figure  2-3-3). 

There  is  a  subtle  difference  between  designing  a  traditional  pipeline  and  a  RISC-style  pipe¬ 
line  that  handles  integer  instructions  only.  In  designing  a  traditional  pipeline,  the  main  concern  is 
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The  SPUR  CPU  4-Stage  Pipeline: 
Ifet  I  Exec  I  Mem  I  Wr  | 


Ifet  Exec  I  Mem  I  Wr 


The  SPUR  CPU  4-Phase  Clock: 
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The  SPUR  CPU  pipeline  has  16  time  slots: 

I  1  :  2  :  3  :  4  I  5  ;  6  :  7  :  8  I  9  llOlnlnllSlMllsilsI 


Figure  5-4-3  Pipeline  and  Clocking 

The  SPUR  CPU  uses  a  4-stage  (Ifet,  Exec  ,  Mem,  Wr)  pipeline-each  instruction  takes  four  cycles 
to  finish.  The  SPUR  CPU  also  uses  a  4-phase  clock-each  cycle  is  divided  into  four  phases.  The 
4-stage  pipeline  together  with  the  4-phase  clock  provide  16  time  slots  to  do  useful  work. 


how  to  schedule  the  issue  of  instructions  such  that  there  will  not  be  any  stmctural  conflict 
[KogSl].  An  example  of  structural  conflict  is  two  instructions  trying  to  use  the  ALU  during  the 
same  cycle.  On  the  other  hand,  a  RISC  processor  that  only  supports  integer  operations  can  exe¬ 
cute  the  simple  instructions  in  a  very  uniform  manner.  This  makes  structural  conflicts  very  easy 
to  detect  and  eliminate.  Therefore,  the  main  concern  in  designing  a  RISC-style  pipeline  is  not 
instruction  scheduling-the  main  concern  is  what  kind  of  resources  are  needed  to  eliminate  all 
structural  conflict  such  that  instruction  can  be  issued  every  cycle.  Since  most  if  not  all  structural 
conflicts  can  be  eliminated  from  a  RISC-style  pipeline,  the  are  only  two  things  left  that  can 
degrade  a  RISC-style  pipeline’s  efficiency:  branching  and  data  dependency.  Important  questions 
the  microarchitectural  must  keep  in  mind  when  he  designs  a  RISC-style  pipeline  are: 

•  The  cost  of  branching-how  many  cycles  are  wasted? 
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Figure  5-4-4  M-Stage  Pipeline 

This  is  an  ideal  M-stage  pipeline  where  any  one  of  the  N  types  of  instructions  can  be  issued  at 
every  cycle.  At  any  given  cycle,  this  pipeline  can  be  in  any  one  of  the  Ntype^'"*'  possible  states. 
All  these  states  must  be  controlled  properly.  Therefore,  for  the  same  number  of  instruction  types, 
longer  pipeline  also  need  more  complex  control  which  may  offset  the  advantage  of  longer  pipe¬ 
line. 


•  The  cost  of  data  dependency-how  many  cycles  will  an  instruction  have  to  wait  for  data? 

•  For  a  given  pipeline  length,  which  clocking  scheme  achieves  the  best  cycle  time? 

•  What  are  the  type  and  complexity  of  the  macro-modules  are  needed  to  implement  a  pipeline 
that  will  allow  any  instruction  to  be  issued  at  every  cycle? 

The  first  two  questions  are  Computer  Science  problems  and  a  lot  of  work  has  been  done  on 
them  [KogSl]  [McH86].  The  last  two  problems  are  Electrical  Engineering  problems  and  not 
much  work  has  been  done.  I  think  the  best  way  to  answer  these  last  two  questions  is  to  follow  the 
following  procedure: 

(1)  Divide  the  instruction  set  into  N  types  of  instructions.  Register-register  operations 
(Reg_Reg)  and  load  operations  (Load)  are  two  examples  of  instructions  types  in  the 


SPUR  CPU. 
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Figure  5-4-5  Operations  for  Several  SPUR  Instruction  Types 


The  operations  for  Reg_Reg,  Load,  and  Cmp_Branch  type  instructions  are  listed  in  the  (macro¬ 
module  :  task)  format  All  instructions  within  a  type  must  require  the  same  macro-module  to  per¬ 
form  the  same  task  during  the  same  stage  of  the  pipeline.  For  example,  for  all  Reg_Reg  instruc¬ 
tions,  during  the  Exec  stage,  the  Operand  Supplier  must  supply  the  operands  and  the  Functional 
Unit  must  operate  on  the  operands. 


(2)  List  the  steps  it  takes  to  execute  each  type  of  instruction.  This  will  give  the  microarchitect 
ideas  about  what  pipe  stages  are  needed  to  execute  all  types  of  instructions  uniformly. 

(3)  Construct  an  uniform  M-stage  pipeline  that  has  the  potential  to  execute  one  instruction 
per  cycle  (Figure  5-4-4).  Examples  of  pipe  stages  in  the  SPUR  CPU  are;  Instruction  fetch 
(Ifet),  register  read  and  execution  (Exec),  memory  access  (Mem),  and  register  write  (Wr). 

(4)  For  each  type  of  instruction,  list  the  operations  for  each  stage  of  the  pipeline.  They  may 
be  given  informally  at  first  but  eventually  the  designer  must  specify  the  operation  in 
terms  of  what  macro-module  is  needed  to  perform  what  task  in  the  (macro-module  ;  task) 
format.  Figure  5-4-5  shows  several  examples. 

(5)  Based  on  the  results  of  Step  4,  construct  the  list  of  necessary  macro-modules.  For  exam¬ 
ple,  by  examining  Figure  5-4-5,  one  can  construct  this  list  of  macro-modules  for 
Reg_Reg,  Load,  and  Cmp_Branch;  I-Unit,  PC-Logic,  CCJnterface,  Operand  Supplier, 
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and  Functional  Unit. 

(6)  Constmct  the  list  of  operations  each  macro-module  must  provide  by  using  the  result  of 

Step  4  and  examining  the  possible  pipeline  states.  For  example,  Figure  5^-6 

shows  the  Operand  Supplier  must  provide  Read  and  Write  operations  every  cycle. 

(7)  After  careful  examination  of  each  macro-module’s  list  of  operations,  propose  a  clocking 
scheme.  For  example,  since  the  easiest  way  to  implement  a  large  register  file  in  CMOS  is 
to  precharge  the  bit  lines  before  read  and  write,  the  Operand  Supplier  must  perforai  (1) 
read,  (2)  precharge  for  write,  (3)  write,  and  (4)  precharge  for  read  within  a  cycle.  The  4- 
phase  clock  is  a  natural  clocking  scheme  for  these  four  distinct  events. 

In  the  above  procedure,  all  instructions  of  the  same  type  will  end  up  having  the  same  execu¬ 
tion  model  (Figure  5-4-5)  and  make  the  same  contributions  to  the  list  of  macro-modules  and  their 
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Figure  5-4-6  Potential  Structural  Conflict 

This  is  a  generic  diagram  for  all  the  pipeline  states  that  has  Reg_Reg’s  Exec  stage  and  Load’s  Wr 
stage.  Load’s  Exec  Stage  requires  the  (Operand  Supplier  :  Read)  operation  and  Reg_Reg’s  Wr 
stage  requires  the  (Operand  Supplier  :  Write).  The  Operand  Supplier  must  be  able  to  perform 
Read  and  Write  within  a  cycle  in  order  to  prevent  structural  conflict.  For  simplicity,  only 
relevant  operations  are  shown  in  this  figure. 
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lists  of  operations.  Since  the  set  macro-modules  and  their  lists  of  operations  define  the  level  of 
abstraction,  the  division  of  instmctions  into  types  (Step  1)  will  determine  the  level  of  abstraction 
the  microarchitect  looks  at  the  proposed  microarchitecture.  For  example,  if  in  Step  1  we  divide 
the  instmctions  into  high  level  types  such  as  Reg_Reg,  then  we  will  be  listing  the  operations  for 
macro-modules  such  as  Functional  Unit  in  Step  4.  On  the  other  hand,  if  in  Step  1  we  divide  the 
instmctions  into  low  level  types  such  as  Add  and  Shift,  then  we  will  be  listing  the  operations  for 
micro-modules  such  as  ALU  and  Shifter  in  Step  4. 

At  the  lowest  level,  every  instmction  is  a  separate  type  because  each  instmction  must 
behave  differently  in  some  way  from  the  others.  This  low  level  of  detail  is  probably  not  neces¬ 
sary  if  one  only  wants  to  study  pipeline  and  clocking  alternatives.  The  microarchitect  should 
therefore  pick  a  level  of  abstraction  just  low  enough  to  show  the  characteristics  of  different  pipe¬ 
line  and  clocking  alternatives  but  not  so  low  that  it  requires  a  large  number  of  modules  each  with 
a  list  of  very  specific  operations.  The  only  way  to  find  out  the  proper  level  of  abstraction  by 
iterating  Step  1  through  Step  6  of  this  section.  Another  reason  why  iteration  may  be  necessary  is 
that  the  microarchitect  may  find  out  in  Step  6  that  the  list  of  operations  for  a  macro-module  is  too 
long  and  thus  the  macro-module  is  too  complex.  He  may  then  have  to  go  back  to  Step  4  and 
either  assign  some  of  its  operations  to  other  existing  modules  or  create  some  new  macro-modules. 
Since  this  is  an  iteration  process  and  at  each  iteration  the  designer  may  have  to  examine  as  many 
as  states,  CAD  tools  must  be  developed  to  ease  the  designer’s  task. 

5.4.3.  The  Abstract  Model  of  the  Microarchitecture 

The  procedure  described  in  last  section  will  not  only  create  a  pipeline  and  clocking  scheme 
but  will  also  select  a  set  of  macro-modules  (Step  4).  For  example,  if  this  procedure  is  used  for 
the  SPUR  CPU,  the  set  of  macro-modules  selected  at  this  point  will  be:  I-Unit,  Operand  Supplier, 
Functional  Unit,  CC_Interface,  PC-Logic,  and  Special  Registers.  In  order  to  complete  the  abstract 
model  of  the  proposed  microarchitecture,  these  macro-modules  must  be  connected.  A  systematic 
way  to  propose  a  connection  scheme  is  to  examine  the  interaction  between  the  macro-modules 
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Figure  5-4-7  On-Chip  Interaction 


Each  node  in  this  graph  represents  a  macro-module.  Each  arc  in  this  graph  represents  a  set  of  sig¬ 
nals  that  must  be  sent  from  one  macro-module  to  the  others.  The  number  associated  with  each  arc 
is  the  number  of  signals  in  the  set  and  the  name  of  the  bus  assigned  to  the  arc(s)  is  in  parentheses. 
This  graph  is  constructed  by  looking  at  one  macro-module  at  a  time  and  consider  to  which 
macro-module  must  it  send  its  output. 


graphically. 

Figure  5-4-7  is  a  directed  graph  that  shows  the  interaction  between  the  various  macro¬ 
modules  of  the  SPUR  CPU.  The  simplest  connection  scheme  can  be  derived  from  this  graph  by 
assigning  a  signal  bus  to  each  arc.  Such  a  scheme,  however,  will  also  be  very  expensive  in  terms 
of  resources.  A  better  approach  is  to  group  some  of  the  arcs  together  and  assign  them  a  single  bus. 
For  example,  in  Figure  5-4-7,  busResult  is  assigned  to  all  output  arcs  of  the  Functional  Unit  and 
some  input  arcs  to  the  Operand  Supplier.  Furthermore,  busL  is  assigned  to  both  arcs  connecting 
to  the  Data  Pins.  Figure  5-4-7  is  the  logical  bus  structure.  The  physical  bus  structure,  which  is 
shown  in  Figure  5-4-8,  is  slightly  different. 
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Figure  5-4-8  SPUR  CPU  Abstract  Model 

In  Figure  5-4-7,  busResuU  is  assigned  to  all  output  arcs  of  the  Functional  Unit  and  some  input 
arcs  to  the  Operand  Supplier.  When  implementation  is  taken  into  account,  busResult  is  broken 
into  busS  and  busD  to  reduce  the  bus  capacitance  that  each  module  has  to  drive.  PC  Logic  Md 
Special  Registers  must  first  send  the  values  onto  busS  which  then  drives  busD.  In  general,  asking 
one  bus  to  drive  another  can  cause  a  lot  of  delay.  It  is  acceptable  here  because  PC  Logic  and  Spe¬ 
cial  Registers  put  values  onto  busS  during  (|)2  and  busD  does  not  have  to  be  driven  until  the  fol¬ 
lowing  (t>4-not  in  the  same  phase. 


Figure  5-4-8  is  the  block  diagram  of  the  abstract  model  of  microarchitecture  we  have  pro¬ 
posed  thus  far.  In  order  to  verify  and  evaluate  this  proposed  microarchitecture,  it  must  be 
modeled  by  hardware  description  language.  This  is  called  the  abstract  model  and  the  major  issues 
this  model  will  be  used  to  study  are:  (1)  off-chip  communication,  and  (2)  pipeline  and  clocking. 
In  modeling,  it  is  utmost  important  to  keep  in  mind  the  things  one  wants  the  model  to  examine 
and  keep  the  model  just  complex  enough  to  do  the  job.  Therefore,  the  Instruction  Unit  is  not 
included  in  this  model  and  instruction  types  instead  of  individual  instructions  are  used.  Ideally, 
for  each  proposed  microarchitecture  at  the  abstract  level,  the  microarchitect  should  use  an 
abstract  model  such  as  Figure  5-4-8  to  answer  all  the  unanswered  questions  before  moving  on  to 
the  detail  level  of  the  microarchitecture.  Figure  5-4-9  reviews  how  the  abstract  model  is  con¬ 


structed. 
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Figure  5-4-9  Building  the  Abstract  Model 

This  is  a  flow  chart  on  the  construction  of  the  abstract  model  (Figure  54-8).  In  the  beginning, 
there  are  the  instruction  set  and  the  interface  specification.  A  pipeline  scheme  is  proposed  by 
looking  at  the  instruction  set.  The  execution  model  for  each  type  of  instruction  with  respect  to  the 
pipeline  gives  a  list  of  the  macro-modules  required  and  their  operations.  The  list  of  operations  for 
each  module  can  then  be  used  to  determine  the  clocking  scheme.  The  clocking  scheme  then  to¬ 
gether  with  the  interface  specification  will  determine  the  off-chip  communication  strategy.  Final¬ 
ly,  on-chip  interaction  must  be  taken  into  account  in  order  to  connect  the  set  of  macro-modules 
together. 


5,5.  The  Expansion  Stage  of  Microarchitectural  Design 

The  previous  section  shows  how  to  create,  specify,  and  examine  a  microarchitecture  at  the 
abstract  level.  The  resulting  abstract  model  consists  of  a  set  of  macro-modules  controlled  by  a 
high  level  controller.  The  goal  of  the  expansion  stage  is  to  obtain  a  detailed  specification  of  the 
microarchitecture  by  expanding  the  macro-modules  and  the  high  level  controller. 

5.5.1.  Micro-Modules  Selection  and  Resources  Allocation 

The  functionality  of  the  macro-modules  are  defined  during  the  abstract  stage.  The  microar¬ 
chitect  has  many  options  to  achieve  this  functionality  by  selecting  different  sets  of  micro¬ 
modules.  For  example,  Figure  5-5-1  is  one  possible  option  the  microarchitect  may  use  to  expand 
the  macro-module  Functional  Unit.  Another  option  is  to  use  a  32-bit  barrel  shifter  instead  of  the 
EXT_1NS  and  SHIFTER.  The  microarchitect  must  evaluate  the  performance,  resources,  and  com¬ 
plexity  tradeoffs  quantitatively  when  he  considers  his  options.  This  is  similar  to  the  problem 
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studied  in  Chapter  4  where  the  microarchitect  has  many  options  in  what  features  to  include  in  the 
CPU.  The  systematic  approach  shown  in  Section  4.7.3,  which  was  based  on  the  performance 
resources  and  complexity  tradeoffs,  can  be  applied  here. 

Figure  5-5-2  illustrates  graphically  the  performance,  resources,  complexity  tradeolfs.  This  is 
similar  to  Figure  4-7-2  except  that  in  Figure  4-7-2,  the  options  on  the  vertical  axis  corresponds  to 
different  features  that  can  be  included  in  the  CPU.  Here  in  Figure  5-5-2,  the  options  on  the  verti¬ 
cal  axis  corresponds  to  different  ways  a  macro-module  can  be  expanded.  For  example,  suppose 
we  are  considering  the  tradeoffs  in  building  the  macro-module  Instruction  UniL  Option  1  could 
be  a  simple  Gow  in  complexity),  low  cost  in  resource,  and  low  performance  direct-mapped  cache. 
Option  2  is  similar  but  with  bigger  cache  size  (more  resources).  Option  3  can  be  a  cache  with  pre¬ 
fetching  (more  complex)  such  that  the  cache  size  can  be  smaller  (fewer  resources)  and  still 
achieve  the  same  performance  as  Option  2.  Option  4  can  be  considered  as  a  set  associative  cache 
with  prefetch  which  gives  higher  perforaiance  than  Option  3  but  also  require  more  resources  and 
higher  degree  of  complexity. 
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Figure  5-5-1  Micro-Modules  Selection  for  the  SPUR  CPU  Functional  Unit 

In  this  example,  the  macro-modules  Functional  Unit  is  expanded  into  micro-modules  EXT_1NS 
(byte  extractor  insertor),  SHIFTER,  ALU,  BRANCH  COND,  and  BUSSTOD. 
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Figure  S-S-2  Performance  Resources  and  Complexity  Tradeoffs 


The  options,  which  correspond  to  different  ways  the  macro-modules  can  be  expanded,  are  placed 
in  increasing  complexity  on  the  vertical  axis.  The  performance  and  resources  needed  for  these 
options  are  plotted  on  the  horizontal  axes.  The  performance  requirement  and  resources  avmlable 
for  each  macro-module  place  the  "acceptable  performance"  bound  on  the  performance  axis  and 
the  "resource  available"  bound  on  the  resources  axis. 


Each  macro-module  has  is  own  minimum  performance  requirement.  This  requirement  is  a 
direct  result  of  the  overall  performance  goal  and  can  be  obtained  during  simulation  at  the  abstract 
stage.  The  performance  requirement  for  each  macro-module  places  an  "acceptable  performance" 
bound  on  the  performance  axis  in  Figure  5-5-2.  Given  this  performance  requirement,  the  microar¬ 
chitect  must  allocate  enough  resources  such  that  an  option  that  has  acceptable  performance  and 
complexity  can  be  buUt.  Below  is  an  example  on  how  we  can  apply  the  systematic  approach  dis¬ 
cussed  in  Section  4.7.3  to  select  an  option  to  expand  the  macro-modules.  This  is  very  similar  to 
the  example  shown  in  Section  4.7.3  except  there  the  options  are  what  features  to  be  included  in 
the  SPUR  CPU. 

(1)  Make  an  educated  guess  on  how  many  resources  you  are  willing  to  spend  on  this  macro- 
module.  This  places  a  "resource  available"  bound  on  resources  axis  in  Figure  5-5-2. 

(2)  Within  this  bound,  pick  the  simplest  option  available. 

(3)  If  this  option’s  performance  is  within  the  acceptable  range,  then  mission  accomplished. 
Otherwise,  go  to  Step  4. 
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(4)  If  there  are  any  other  options  within  the  resource  bound,  pick  the  next  more  complex 
option  and  go  back  to  Step  3.  Otherwise  go  to  Step  1  and  increase  the  resources  available 
boimd. 

Using  Figure  5-5-2  as  an  example,  Step  2  of  this  procedure  will  pick  Option  1.  However,  in 
Step  3,  we  will  find  out  Option  I’s  performance  is  below  the  acceptable  range.  In  Step  4,  Option 
2  is  not  chosen  in  Step  4  because  it  uses  more  resources  than  available.  Option  3  will  be  chosen 
because  it  is  less  complex  than  Option  4.  Finally,  when  we  get  back  to  Step  3,  we  will  find  out 
Option  3’s  performance  is  acceptable. 

5^.2.  On-Chip  Interaction  and  Second  Order  Effects 

At  the  abstract  stage  of  the  design,  on-chip  interaction  concerns  with  the  interaction  among 
various  macro-modules.  Section  5.4  showed  how  this  problem  can  be  solved  by  connecting  the 
macro-modules  via  signal  busses  embedded  in  the  datapath.  At  the  expansion  stage,  on-chip 


Figure  5-5-3  The  SPUR  CPU  Control  Strategy 

The  SPUR  CPU  is  controlled  by  three  modules:  Trap  Logic  detects  all  internal  and  external 
unusual  conditions,  I-Unit  Controller  controls  the  instruction  unit  that  delivers  instruction,  and 
E-Unit  Controller  decodes  every  instruction  it  receives  into  control  signals.  Whenever  the  Trap 
Logic  detects  an  unusual  condition,  all  it  has  to  do  is  to  send  a  signal  {trapRequest)  to  the  I-Unit 
Controller.  The  I-Unit  Controller  then  delivers  the  proper  internal  instmctions  that  can  handle  the 
trap  to  the  E-Unit  Controller. 
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interaction  concerns  the  control  of  the  micro-modules.  We  must  also  take  into  account  second 
order  effects-such  as  trap  detection-that  are  ignored  during  the  abstract.  Second  order  effects  are 
very  important  to  the  controller  design  during  the  expansion  stage.  A  systematic  approach  to  this 
problem  can  be  summarized  in  three  words;  isolation,  specialization,  and  optimization. 

Isolation 

Isolate  different  aspects  of  the  control  function  into  sub-functions  that  have  minimum 
interaction  among  them. 

Specialization 

Design  specialized  modules  for  the  sub-functions. 

Optimization 

Finally,  optimize  the  specialized  modules. 

In  order  to  illustrate  this  three-step  approach,  I  will  use  the  SPUR  CPU  as  an  example.  The 
SPUR  CPU  control  strategy  is  shown  in  Figure  5-5-3.  The  control  function  of  the  SPUR  CPU  can 
be  isolated  into  three  sub-functions.  These  three  sub-functions  and  their  respective  specialized 
modules  are; 

•  The  control  of  the  Instruction  Unit  (I-Unit)  that  delivers  the  instmction.  This  is  handled  by 
the  I-Unit  Controller. 

•  The  control  of  the  Execution  Unit  (E-Unit)  that  executes  the  instmction.  This  is  handled  by 
the  E-Unit  Controller. 

•  The  detection  of  internal  and  external  unusual  conditions.  This  is  handled  by  the  Trap 
Logic. 

One  major  optimization  we  performed  in  the  SPUR  CPU  is  the  use  of  internal  instmctions 
to  further  reduce  the  on-chip  interaction.  As  illustrated  in  Figure  5-5-3,  the  Trap  Logic  asserts  the 
trapRequest  signal  whenever  it  detects  any  unusual  conditioa  Upon  receiving  the  trapRequest 
signal,  the  I-Unit  Controller  will  deliver  the  internal  instmctions  that  handle  trap  to  the  E-Unit 
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controller.  This  optimization  therefore  reduces  the  interaction  between  the  Trap  Logic  and  the  I- 
Unit  Controller  to  one  signal— This  optimization  also  limited  the  interaction  between 
the  I-Unit  Controller  and  the  E-Unit  Controller  to  normal  and  internal  instructions  only. 

The  I-Unit  Controller  consists  of  two  finite  state  machines  (Figure  2-2-1)  and  the  Trap 
Logic  consists  of  five  random  logic  blocks  (Figure  2-4-1).  Their  designs  are  simple  finite  state 
machines  and  random  logic  design  problems  and  were  discussed  earlier  in  Section  2.2  and  Sec¬ 
tion  2.4,  respectively.  Neither  the  I-Unit  Controller  nor  the  Trap  logic  involve  further  on-chip 
interaction  consideration.  The  E-Unit  Controller,  on  the  other  hand,  must  distribute  the  control 
information  it  generates  to  the  micro-modules  in  the  datapath.  The  E-Unit  Controller  does  not 
have  to  distinguish  internal  instructions  from  normal  instructions.  It  simply  generates  a  set  of 
control  signals  for  each  instruction  it  receives.  The  E-Unit  Controller  is  therefore  just  a  combina¬ 
tional  logic  block.  The  rest  of  this  section  will  discuss  strategies  that  can  be  used  to  reduce  the 
interaction  between  the  E-Unit  Controller  and  the  datapath. 


Figure  5-5-4  The  E-Unit  Controller  Bus  Structure 

The  Master  Control  is  divided  into  M  stages  where  M  is  the  length  of  the  pipeline.  Each  stage 
generates  one  set  of  high  level  control  signals  that  controls  one  stage  of  the  pipeline.  High  level 
control  signals  are  distributed  via  the  high  level  control  signal  bus.  The  local  decoding  logic 
blocks  then  generates  the  low  level  control  signals  that  control  the  datapath. 
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Figure  5-5-5  Distribution  of  Control  Information 


The  high  level  control  signals  are  decoded  into  low  level  control  signals  by  very  simple  combina¬ 
tional  logic.  Most  outputs  of  the  combinational  logic  must  be  ANDed  with  one  of  the  clock 
phases  (1, 2,  3,  or  4)  before  being  used  by  the  datapath.  For  those  outputs  that  do  not  have  to  be 
ANDed  with  any  clock  phase,  they  are  buffered  (0).  Furthermore,  in  order  to  reduce  the  load  on 
the  clock  generator,  the  clock  signals  are  also  buffered  (5  and  6)  before  they  are  fed  into  the  data¬ 
path. 


Signal  busses  are  used  during  the  abstract  stage  to  reduce  the  interactions  among  various 
macro-modules.  Similarly,  signal  busses  can  also  be  used  here  to  reduce  the  interaction  between 
E-Unit  Controller  and  the  datapath.  This  is  illustrated  in  Figure  5-5-4.  The  E-Unit  Controller  is 
divided  into  two  parts:  the  Master  Control  and  Local  Decoding  Logic.  The  master  control  is 
located  far  away  from  the  datapath  but  the  local  decoding  logic  blocks  are  placed  right  next  to  the 
datapath.  A  signal  bus  is  used  to  distribute  the  high  level  control  signals  generated  by  the  master 
control  to  the  local  decoding  logic  blocks.  This  simplifies  the  on-chip  interaction  because  the 
number  of  high  level  control  signals  are  relatively  small  compare  to  the  number  of  low  level  con¬ 
trol  signals.  This  reduction  in  number  is  due  to  sharing-each  high  level  signal  is  used  by  more 
than  one  local  decoding  logic  block. 
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The  master  control  consists  of  M  stages,  where  M  is  the  length  of  the  pipeline  (see  Figure 
5-5-4).  The  first  stage  decodes  the  instruction  (mainly  the  opcode)  into  the  first  set  of  high  level 
control  signals.  The  other  stages  uses  the  outputs  of  their  previous  stage  as  inputs.  The  major 
fimction  of  these  latter  stages  is  to  delay  the  control  signals  by  one  cycle.  One  may  also  put  some 
simple  logic  in  the  latter  stages  to  combine  some  of  their  inputs.  This  is  especially  useful  when 
instmctions  require  different  control  signals  at  early  stages  of  the  pipeline  but  require  the  same 
control  signal  at  latter  stages. 

The  local  decoding  logic  block  consists  of  a  block  of  simple  combinational  logic,  a  clock 
signals  bus,  and  some  AND,  NOT,  and  BUF  gates  between  the  clock  bus  and  the  datapath.  This  is 
illustrated  in  Figure  5-5-5.  Since  CMOS  logic  gates  can  also  serve  as  buffers,  the  simple  combi¬ 
national  logic  block  also  serves  as  the  intermediate  stage  of  the  multi-stage  control  information 
distribution  network  between  the  E-Unit  Controller  and  the  datapath.  Similarly,  the  AND,  NOT, 
and  BUF  gates  between  the  clock  bus  and  the  datapath  serves  as  the  final  stage.  The  sizes  of  the 
buffers  at  the  final  stage  should  be  selected  according  to  the  RC  loading  of  the  low  level  control 
signals  and  clock  wires  such  that  all  the  buffers  will  drive  their  outputs  within  approximately  the 
same  delay. 

The  control  strategy  shown  in  Figure  5-5-3,  Figure  5-5-4,  and  Figure  5-5-5,  was  used  in  the 
SPUR  CPU.  This  strategy  resulted  in  a  much  more  compact  controller  for  the  SPUR  CPU  (see 
Figure  1-4-3)  than  the  controller  for  SOAR  (see  Figure  1-4-2).  One  important  question  concern¬ 
ing  this  strategy  is  that  how  the  designer  should  divide  the  high  and  low  level  decoding.  There 
must  be  a  good  balance  between  the  number  of  high  level  control  signals  and  the  complexity  of 
the  local  decoding  logic.  I  do  not  have  a  satisfactory  answer,  but  the  SPUR  CPU  implementation 
do  provide  us  some  insights  to  this  question.  In  the  SPUR  CPU,  the  combinational  logic  in  the 
local  decoding  logic  are  single  level  logic  and  are  mostly  OR  gates.  This  indicates  that  this  is  a 
generalized  PLA  problem  with  the  the  local  decoding  logic  as  the  OR  plane  and  the  high  level 
control  signals  as  the  product  terms.  In  a  more  general  view,  one  may  also  consider  this  as  a  mul- 
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tiple  level  logic  optimization  problem  with  the  local  decoding  logic  as  the  last  logic  level.  This, 
however,  is  a  more  challenging  problem  for  the  CAD  tools  designer  than  the  normal  multiple 
level  logic  optimization  problem  for  the  following  reasons: 

•  The  area  available  for  the  combinational  logic  within  the  local  decoding  logic  block  are  res¬ 
tricted  by  the  spacing  between  the  low  level  control  signals. 

•  The  optimum  size  of  the  simple  combinational  logic  gates  and  the  final  buffers  (see  Figure 
5-5-5)  depends  on  the  RC  loading  of  the  low  level  control  signals. 

5  J.3.  The  Expanded  Model  of  the  Microarchitecture 

The  expansion  stage  of  the  microarchitectural  design  process  takes  into  account  the  micro¬ 
modules  selection,  resource  allocation,  and  on-chip  interaction.  The  result  is  the  expanded  model 
of  the  proposed  microarchitecture.  This  is  illustrated  in  Figure  5-5-6.  The  SPUR  CPU  behavioral 


Abstract  Model  j  |  Unusual  Conditions  | 


Expanded  Model 

Resource 
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Micro-Modules  Selection  | 

Control  Strategy 

Figure  5-5-6  Building  the  Expanded  Model 

This  is  a  flow  chart  on  the  construction  of  the  expanded  model.  At  this  stage  of  the  design,  the 
microarchitect  has  an  abstract  model  of  the  microarchitecture.  He  must  also  consider  all  the 
unusual  conditions  that  can  affect  the  normal  operation  of  the  microarchitecture.  Macro-modules 
in  the  abstract  model  are  then  expanded  into  micro-modules  according  to  the  resource  available 
and  the  functionality  of  the  macro-modules.  The  microarchitect  must  also  consider  on-chip  in¬ 
teraction  carefully  in  order  to  derive  a  control  strategy  that  controls  the  micro-modules  and  han¬ 
dles  all  the  unusual  conditions. 
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model  (Figure  3-2-2)  can  be  considered  as  the  expanded  model  of  the  SPUR  CPU.  A  complete 
expanded  model  is  time  consuming  to  build.  Therefore,  the  microarchitect  should  not  start  build¬ 
ing  the  expanded  model  until  he  has  done  enough  primary  investigation  using  the  abstract  model. 
The  expanded  model,  however,  can  provide  much  better  insights  into  the  performance,  resources, 
and  complexity  tradeoffs  of  the  proposed  microarchitecture  than  the  abstract  model. 

5,5.3.1.  Using  the  Expanded  Model  for  Performance  Estimation 

In  Chapter  4, 1  have  shown  that  the  performance  of  a  microarchitecture  should  be  measured 
in  terms  of  the  Tx/xC  product.  The  expanded  model  of  the  CPU  can  be  used  to  estimate  the 
C-the  average  number  of  cycles  per  instruction,  and  the  r-the  cycle  time  more  accurately  than 
using  the  abstract  model. 

For  example,  in  the  N.2  environment  where  the  expanded  model  of  the  SPUR  CPU  is  simu¬ 
lated,  test  programs  can  be  compiled  and  loaded  into  simulated  memory.  The  SPUR  CPU 
expanded  model  can  then  be  simulated  using  the  test  programs  that  reside  in  the  simulated 
memory.  Since  the  expanded  model  models  the  details  of  the  microarchitecture,  it  can  be  used  to 
measure  the  number  of  cycles  the  CPU  takes  to  execute  certain  test  programs  (/xC)  accurately. 
Unfortunately,  the  simulator  for  the  expanded  model  can  be  relatively  slow.  For  example,  the 
SPUR  CPU  expanded  model  takes  an  average  of  2  SUN3/160  CPU  seconds  to  simulate  each 
instruction!  This  will  severely  limit  the  size  of  test  programs  that  can  be  simulated.  However,  if 
the  test  programs  have  the  proper  mix  of  instructions,  the  microarchitect  can  still  calculate  the 
average  number  of  cycles  per  instruction  (C)  accurately  by  dividing  the  cycle  count  (/xC)  he 
measured  by  the  number  of  instructions  (/)  in  the  test  program. 

In  the  N.2  environment,  the  SPUR  CPU  expanded  model  can  be  used  to  estimate  the 
microarchitccture’s  cycle  time  in  two  different  ways: 

•  The  N.2  simulator  can  perform  simple  timing  analysis  if  explicit  timing  information  is  pro¬ 
vided  for  each  micro-module,  and 
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•  The  N.2  simulator  can  trace  the  significant  sequential  events  within  each  clock  phase.  This 

list  of  events  will  give  the  microarchitect  some  insights  on  the  duration  of  each  clock  phase. 

Once  the  microarchitect  has  estimated  the  cycle  time  (T),  and  has  calculated  the  average  number 
of  cycles  per  instruction  (C),  the  last  thing  he  has  to  do  before  he  can  predict  the  CPU’s  perfor¬ 
mance  accurately  is  to  get  an  estimate  of  the  number  of  instructions  the  CPU  takes  to  execute  cer¬ 
tain  large  benchmarks  (/).  This,  of  course,  can  be  measured  from  the  instmction  level  simulator. 

5.5.3.2.  Using  the  Expanded  Model  for  Resources  Estimation 

As  discussed  in  Section  3.2.1,  the  SPUR  CPU  behavioral  model  (the  expanded  model)  is  a 
composite  module  that  consists  of  many  modules  connected  by  a  top  level  topology  file.  The 
dimensions  of  each  module  can  be  estimated  either  based  on  previous  experience  or  better  yet 
based  on  the  results  of  resource  allocation  analysis  illustrated  in  Figure  5-5-2.  The  dimensions  of 
each  module  can  then  be  added  to  the  topology  file  as  comments.  These  dimensions  together  with 
the  connection  information  in  the  topology  file  can  be  used  by  a  designer  assisted  by  CAD  tools 
to  create  a  tentative  floor  plan  that  gives  a  rough  estimate  of  chip  area.  Floor  planning  is  dis¬ 
cussed  in  more  details  in  Section  5.5.6. 

The  expanded  model  can  also  be  used  for  power  estimatioa  In  CMOS  where  static  power 
consumption  is  low,  most  of  the  power  will  be  consumed  in  driving  busses  and  off-chip  output 
pads.  All  busses  and  output  pads  are  simply  internal  and  external  connections  in  the  top  level 
topology  file  that  describes  the  expanded  model.  Furthermore,  if  power  consumption  of  certain 
modules  is  significant  compared  to  busses  and  pads,  this  information  can  also  be  added  to  the 
topology  file  as  comments.  Therefore,  power  consumption  can  also  be  extracted  easily  from  the 
expanded  model. 

5,5.3  J.  Using  the  Expanded  Model  for  Complexity  Estimation 

In  Chapter  4, 1  have  shown  that  the  complexity  of  a  microarchitecture  can  be  measured  in 
terms  of  the  number  of  cycles  of  diagnostics  and  the  human  effort  it  takes  to  verify  the 
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microarchitecturc.  The  expanded  model  is  tested  by  test  programs.  The  size  of  these  test  pro¬ 
grams  and  the  human  effort  involve  in  preparing  and  running  these  tests  are  therefore  good  meas¬ 
ures  of  the  microarchitecture’s  complexity.  These,  however,  only  measure  one  aspect  of  complex¬ 
ity  which  I  called  the  functional  complexity.  Functional  complexity  measures  the  degree  of 
difficulty  in  analysis,  design,  and  testing  of  the  microprocessor. 

Implementation  complexity  is  another  aspect  of  complexity.  Implementation  complexity 
measures  the  degree  of  difficulty  in  implementing  the  microprocessor.  The  total  number  of 
micro-modules  in  the  expanded  model,  the  size  of  the  micro-modules’  description,  the  number  of 
low  level  control  signals  that  control  the  micro-modules,  and  the  number  of  high  level  control 
signals  generated  by  the  master  control  are  all  important  metrics  for  the  implementation  complex¬ 
ity. 


5.6.  The  Floor  Planning  Stage  of  Microarchitectural  Design 

As  the  name  implies,  the  goal  of  the  floor  planning  stage  is  to  produce  a  floor  plan  for  the 
proposed  microarchitecture.  There  are  two  important  considerations  in  designing  the  floor  plan: 

•  The  interaction  between  the  macro-modules. 

•  The  relative  dimensions  of  the  macro-modules. 

The  macro-modules  are  the  products  of  the  abstract  stage.  The  interaction  between  the  macro¬ 
modules  are  also  studied  during  the  abstract  stage.  The  relative  dimensions  of  the  macro¬ 
modules,  however,  depends  on  how  they  are  expanded  into  micro-modules  during  the  expansion 
stage.  The  floor  planning  stage  of  microarchitectural  design  can  therefore  be  considered  as  the 
stage  that  summarizes  the  results  from  the  abstract  and  expansion  stage. 

The  interaction  between  the  macro-modules  and  their  relative  dimensions  can  both  be  sum¬ 
marized  in  a  graph.  This  is  done  in  Figure  5-6-1  for  the  SPUR  CPU  as  an  example.  Notice  that 
Figure  5-6-1  is  similar  to  Figure  5-4-7  except  that  I  have  added  the  relative  dimensions 
{height  X  width)  for  each  macro-module.  The  goal  here  is  to  place  the  macro-modules  that  have  a 
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Figure  5-6-1  Important  Information  for  Floor  Planning 

Each  node  in  this  graph  represents  a  macro-module.  The  number  within  each  node  is  the  macro- 
module’s  relative  dimensions  (height  x  width).  Each  arc  in  this  graph  represents  a  set  of  signals 
that  must  be  sent  from  one  macro-module  to  the  others.  The  number  associated  with  each  arc  is 
the  number  of  signals  in  the  set 


large  number  of  of  connections  between  them  close  to  each  other  while  at  the  same  time  try  to 
maintain  the  overall  shape  as  rectangular  as  possible. 

For  example,  based  on  the  dimensions  shown  in  Figure  5-6-1,  we  decided  to  place  the 
Instruction  Unit  on  top  of  the  Operand  Supplier  because  they  are  the  biggest  and  longest.  On  the 
other  hand,  based  on  the  interactions  shown  in  Figure  5-6-1,  we  decided  to  place  the  PC  Logic 
adjacent  to  the  Instruction  Unit  and  the  Functional  Unit  adjacent  to  the  Operand  Supplier.  After 
careful  consideration  of  the  rest  of  Figure  5-6-1,  we  selected  the  floor  plan  shown  in  Figure  5-6-2 


for  the  SPUR  CPU. 
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Figure  5-6-2  The  SPUR  CPU  Floor  Plan 

This  is  the  floor  plan  of  the  SPUR  CPU.  The  relative  dimensions  of  each  macro-module  are 
shown  in  parentheses.  Relative  dimensions  are  used  instead  of  absolute  dimensions  such  that  a 
tentative  floor  plan  can  be  produced  even  before  the  exact  technology  is  known. 


5.7.  Conclusion 

In  this  chapter,  I  first  defined  the  term  microarchitecture  and  the  phrase  "microarchitectural 
design."  Based  on  my  experiences  in  SPUR,  I  believed  that  the  important  issues  concerning 
microarchitectural  design  are:  (1)  off-chip  communication,  (2)  pipeline  and  clocking,  (3)  micro¬ 
modules  selection,  (4)  resources  allocation,  (5)  on-chip  interaction,  and  (6)  floor  planning.  Off- 
chip  communication,  and  pipeline  and  clocking  are  behavioral  issues.  Micro-modules  selection, 
on-chip  interaction,  and  resources  allocation  are  structural  issues.  Finally,  floor  planning  is  a  phy¬ 
sical  issue. 

A  systematic  approach  to  microarchitectural  design  must  begin  with  a  systematic  approach 
to  these  microarchitectural  issues.  More  specificly,  a  systematic  approach  to  microarchitectural 
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design  consists  of  three  stages: 

(1)  The  Abstract  Stage 

Constnict  the  abstract  model  (Figure  5-3-2)  and  then  use  it  to  conduct  studies  on  microar- 
chitectural  issues:  (a)  off-chip  communication,  and  (b)  pipeline  and  clocking. 

(2)  The  Expansion  Stage 

Construct  the  expanded  model  (Figure  5-3-3)  by  expanding  the  macro-modules  and  the  con¬ 
troller  in  the  abstract  model.  The  microarchitectural  issues  to  be  studied  here  are:  (a) 
micro-modules  selection,  (b)  resources  allocation,  and  (c)  on-chip  interaction. 

(3)  The  Floor  Planning  Stage 

Based  on  the  information  we  learned  from  the  abstract  and  expanded  model,  constmct  a 
detail  floor  plan  for  the  microarchitecturc. 

These  three  stages  are  illustrated  in  Section  5.7.4,  Section  5.7.5,  and  Section  5.7.6  using  the 
SPUR  CPU  as  an  example. 
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Chapter  6 

SUMMARY  AND  FUTURE  TRENDS 

I  never  think  of  the  future.  It  comes  soon  enough. 

Albert  Einstein,  1930 


This  chapter  first  summarizes  this  thesis  and  the  SPUR  project  in  Section  6.1.  In  Section 
6.2, 1  discuss  what  I  think  wiU  be  the  future  trends  based  on  the  lessons  I  learned. 

6.1.  Summary 

Section  6.1.1  summarizes  this  thesis.  Section  6.1.2  reviews  the  history  of  the  SPUR  project 
in  the  SPUR  CPU’s  perspective.  Section  6.1.3  discusses  the  organization  of  the  SPUR  project 

6.1.1.  Thesis  Summary 

In  Chapter  1  and  Chapter  2, 1  gave  a  brief  history  of  VLSI  projects  at  U.C.  Berkeley,  an 
overview  of  the  SPUR  project  and  an  overview  of  the  SPUR  CPU  microarchitecture.  One  of  the 
most  important  difference  between  the  SPUR  CPU  and  the  previous  two  generations  of  Berkeley 
RISC  projects  is  that  the  goal  of  the  SPUR  project  is  not  just  to  build  a  CPU.  The  SPUR  project’s 
goal  is  to  build  a  system  in  which  the  SPUR  CPU  is  just  one  of  the  three  custom  VLSI  chips. 

In  Chapter  3, 1  first  talked  about  how  we  used  proven  ideas  from  the  two  previous  genera¬ 
tions  of  Berkeley  RISC  machines  to  design  the  SPUR  CPU  microarchitecture.  Just  because  we 
used  proven  ideas  does  not  mean  our  job  of  building  a  chip  for  a  system  is  easy.  We  still  have  to 
deal  with  many  problems  that  are  not  as  important  when  one  just  wants  to  build  a  CPU.  Two 
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specific  examples  are: 

(1)  We  must  implement  some  system  features  that  require  a  lot  of  woilc,  and 

(2)  We  must  do  many  extra  simulations. 

One  important  lesson  I  learned  in  dealing  with  all  these  problems  is  that  we  must  make  the  pro¬ 
cess  of  designing  microprocessor  more  like  a  science  than  an  art. 

In  Chapter  4, 1  stated  that  the  designer  can  make  the  process  of  designing  microprocessor 
more  like  a  science  than  an  art  by  putting  more  emphasis  on  quantitative  evaluation  of  the  perfor¬ 
mance,  resources,  and  complexity  tradeoffs.  I  also  showed  a  simple  performance  model  and  then 
performed  tradeoffs  evaluation  for  LISP  support,  floating  point  support,  4-stage  pipeline,  on-chip 
instmetion  cache,  and  multiprocessing  support.  One  major  conclusion  from  Chapter  4  is  that  the 
designer  must  keep  the  cycle  time  and  the  average  number  of  cycles  per  instruction  as  low  as  pos¬ 
sible. 

Finally  in  Chapter  5, 1  introduced  a  systematic  approach  to  microarchitectural  design  that 
consists  of  three  stages:  (1)  the  abstract  stage,  (2)  the  expansion  stage,  and  (3)  the  floor  planning 
stage.  During  the  abstract  stage,  we  build  the  abstract  model  of  the  microarchitecture  to  study  the 
off-chip  communication,  and  pipeline  and  clocking  issues.  During  the  expansion  stage,  the 
abstract  model  is  expanded.  The  major  issues  to  be  studied  during  the  expansion  stage  are  micro¬ 
modules  selection,  resources  allocation,  and  on-chip  interaction.  Finally,  during  the  floor  plan¬ 
ning  stage,  we  applied  what  we  learned  from  the  abstract  and  expansion  stage  and  design  a  floor 
plan  for  the  microarchitecture. 

6.1.2.  The  History  of  the  SPUR  Project 

The  SPUR  project  is  probably  one  of  the  most  ambitious  computer  projects  ever  accom¬ 
plished  in  the  university  environment  The  history  of  the  SPUR  project  is  described  below  with 
respect  to  the  SPUR  CPU  development 
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Spring  1985 

The  SPUR  CPU  basic  microarchitecture  was  conceived  in  the  CS292i  class  taught  by  Pro¬ 
fessor  Randy  Katz  [Kat85]. 

Summer  1985 

The  CPU’s  external  interfaces  to  the  cache  controller  (CQ,  the  coprocessor  (FPU),  and  the 
processor  board  were  defined. 

Fall  1985 

We  began  the  datapath  layout  and  started  writing  the  behavioral  description  that  models  the 
microarchitecture. 

Spring  1986 

The  layout  of  the  datapath  was  completed. 

Summer  1986 

We  began  the  layout  of  the  control  unit 
Fall  1986 

The  layout  of  instruction  unit  was  completed.  At  the  same  time,  we  also  completed  the 
behavioral  description. 

Spring  1987 

Logic  simulation  and  timing  analysis  of  individual  modules  was  carried  out. 

Summer  1987 

Global  simulation  was  completed.  The  SPUR  CPU  was  submitted  for  fabrication  on  August 
25.  It  came  back  in  December  1987. 

Spring  1988 

We  tested  and  debugged  the  SPUR  CPU.  The  second  version  of  the  CPU  was  submitted  for 
fabrication  in  April.  It  came  back  in  June  1988. 
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Summer  1988 

We  tested  and  debugged  the  SPUR  processor  board.  The  Spirte  [OCD88]  operating  system 
was  running  on  SPUR  hardware  by  the  end  of  the  summer. 

Fall  1988 

We  concentrated  our  effort  in  multiprocessor  testing  and  debugging. 

Spring  1989 

On  January  9,  1989,  a  SPUR  multiprocessor  running  Spirte  operating  system  and  LISP 
software  was  presented  at  U.C.  Berkeley. 

This  ambitious  four-year  project  involve  a  large  group  of  professors  and  graduate  students. 
Section  6.1.3  shows  the  organization  of  the  SPUR  project 

6.1.3.  The  Organization  of  the  SPUR  Project 

Professor  Dave  Patterson  was  the  principal  investigator  of  the  SPUR  project  The  SPUR 
project  was  organized  into  three  groups:  (1)  the  hardware  group,  (2)  the  operating  system  group, 
and  (3)  the  programming  language  group. 

The  Hardware  Group.  Professor  Randy  Katz  and  Professor  David  Hodges  were  in  charge 
of  the  hardware  group  with  Professor  Randy  Katz  on  architectural  design  and  Professor  David 
Hodges  on  circuit  design.  The  hardware  group  was  furthered  divided  into  four  groups: 

The  CPU  Group 

Mark  Hill  and  George  Taylor  were  responsible  for  the  macroarchitecture  of  the  CPU.  I  was 
responsible  for  the  microarchitecture.  Dave  Lee,  with  the  help  of  Rich  Duncombe  (initial 
implementation  of  the  Instruction  Unit)  and  Wook  Koh  (initial  implementation  of  the  Upper 
Datapath)  were  responsible  for  the  circuit  design  and  layout 

The  Cache  Controller  Group 

David  Wood,  Garth  Gibson,  and  Susan  Eggers  were  responsible  for  the  macroarchitecture 
and  microarchitecture  of  the  Cache  Controller.  D.K.  Jeong  was  responsible  for  the  circuit 
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design  and  layout 
The  Floating  Point  Unit  Group 

B.K.  Bose,  Paul  Hansen,  and  Corina  Lee  were  were  responsible  for  all  aspects  of  the  Float¬ 
ing  Point  Unit  design. 

The  Processor  Board  Group 

Our  staff  engineer  Ken  Lutz  with  the  help  of  Kathy  Armstrong  were  responsible  for  the 
design  and  implementation  of  the  SPUR  processor  board. 

The  hardware  group  also  got  very  valuable  help  form  Joan  Pendleton  during  initial  stage  of  the 
SPUR  project  and  Doug  Johnson-an  engineer  from  Texas  Instruments-during  the  final  stage  of 
the  project.  The  U.C.  Berkeley  CAD  research  community  also  gave  us  constant  support. 

The  Operating  System  Group.  Professor  John  Ousterhout  was  in  charge  of  developing 
the  operating  system  Spirte  for  the  SPUR  multiprocessor.  The  graduate  students  who  worked  in 
the  operating  system  group  were  Michael  Nelson,  Brent  Welch,  Fred  Douglis,  Andrew  Cheren- 
son,  and  Mendel  Resenblura. 

The  Programming  Language  Group.  Professor  Paul  Hillfinger  was  in  charge  of  develop¬ 
ing  the  LISP  system  [Tay86]  [ZHH88]  for  the  SPUR  multiprocessor.  The  graduate  students  who 
woiiced  in  the  programming  language  group  were  Jim  Larus  and  Ben  Zom. 

6.2.  Future  Trends 

In  this  section,  I  want  to  say  a  few  words  about  what  I  think  the  future  trends  are  in  the 
architectural,  technology,  and  CAD  support  areas.  In  my  performance  evaluation,  I  have  shown 
quantitatively  in  Section  4.7.2  that  a  simple  architecture  can  surpass  the  performance  of  a  more 
complex  architecture  by  keeping  the  average  number  of  cycles  per  instruction  and  the  cycle  time 
low.  Therefore,  I  think  the  architectural  trend  is  to  reduce  the  average  number  of  cycles  per 
instruction  and  the  technology  trend  is  to  lower  cycle  time.  Both  of  these  can  be  accomplished 
more  easily  if  CAD  support  is  readily  available. 
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6.2.1.  Architectural  Trends 

The  architectural  trend  is  to  reduce  the  average  number  of  cycles  per  instruction.  The  aver¬ 
age  number  of  cycles  can  be  reduced  by  the  following  methods; 

(1)  Improve  the  hit  rate  of  the  on-chip  instmction  cache.  Mark  Hill  [Hil87]  has  studied  better 
instruction  cache  design  and  implementation  ideas. 

(2)  When  more  on-chip  transistors  are  available,  I  think  on-chip  data  cache  is  more  desirable 
than  complex  functions.  In  Chapter  4, 1  have  shown  that  it  is  more  cost  effective  to  sup¬ 
port  complex  functions  via  a  coprocessor. 

(3)  Reduce  the  branch  penalty.  Branch  folding  in  CRISP  [BDM87]  is  one  example  where  in 
the  best  scenario,  a  branch  is  executed  implicitly  with  other  instructions.  This  reduces 
branch  penalty  to  zero. 

(4)  Finally  one  may  try  to  put  multiple  functional  units  on  chip  such  that  multiple  instmc- 
tions  can  be  executed  per  cycle. 

The  first  two  methods,  even  in  the  best  scenario,  can  only  reduce  the  average  number  of 

cycles  per  instruction  to  one.  The  third  and  fouth  methods,  on  the  other  hand,  can  reduce  the  aver¬ 
age  number  of  cycles  per  instruction  to  less  than  one. 

6.2.2.  Technology  Trends 

The  technology  trend  is  to  reduce  the  cycle  time.  The  cycle  time  can  be  reduced  by: 

(1)  Scaling  down  the  CMOS  technology.  Common  belief  is  that  device  width  of  0.25pm  is 
the  practical  limit  and  it  will  be  reached  in  the  1990s  [MiYH86]. 

(2)  ECL  is  faster  than  CMOS  but  it  also  has  lower  density  and  uses  more  power. 

(3)  BICOMS  takes  advantages  of  both  bipolar  and  CMOS  circuits  on  the  same  chip  and 
looks  very  promising. 
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(4)  GaAs  is  very  fast,  uses  less  power  than  ECL,  but  it  is  very  expensive  in  terms  of  wafer 
cost,  yield,  and  density. 

One  thing  I  want  to  point  out  is  that  although  ECL  logic  gates  use  more  power  than  GaAs 
and  CMOS  logic  gates,  this  may  become  less  important  in  the  future.  I  believe  in  the  future  most 
of  the  power  will  not  be  consumed  in  the  logic  gates.  Most  of  the  power  will  be  consumed  in 
driving  the  internal  busses  and  off-chip  pads.  For  example  in  the  SPUR  CPU,  we  estimated  that 
60%  of  the  total  power  consumption  is  spent  in  driving  the  off-chip  pads  and  20%  is  spent  in 
driving  the  on-chip  busses  already.  These  numbers  are  going  to  get  worst  when  the  feature  size 
gets  smaller  and  when  the  CPU  runs  at  higher  clock  rate  because: 

•  As  feature  sizes  gets  smaller,  the  capacitance  of  the  on-chip  busses  gets  relatively 
bigger-more  power  is  needed  to  drive  them  at  high  speed. 

•  As  the  CPU  runs  faster,  the  off-chip  communication  channel  must  also  run  faster-more 
power  is  needed  to  drive  the  off-chip  pads. 

GaAs  uses  field  effect  transistors  that  are  similar  to  MOS  transistors  use  in  CMOS. 
Although  Seymour  Cray  suggested  that  GaAs  discrete  component  has  low  capacitance  [Cra88], 
this  unfortunately  may  not  apply  to  VLSI  application.  In  VLSI,  capacitance  of  the  multi-level 
interconnect  netwoiic  is  the  dominant  factor.  This  multi-layer  interconnect  capacitance  depends 
on  the  materials  that  separates  the  interconnect  layers  and  does  not  depend  on  the  substrate 
material.  Therefore  GaAs  is  likely  to  have  similar  power  consumption  problems  as  CMOS.  ECL, 
on  the  other  hand,  uses  bipolar  transistors  that  have  much  bigger  current  driving  capability  and 
smaller  voltage  swings  than  field  effect  transistors.  Consequently,  ECL  will  have  less  problem 
driving  highly  capacitive  internal  busses  and  off  chip  pads.  Finally,  one  should  realize  that  a  fast 
CPU  must  run  in  a  fast  environment.  The  power  consumption  of  this  fast  environment  can  make 
the  power  consumption  of  the  CPU  negligible. 
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6.2.3.  CAD  Support  Trends 

I  believe  CAD  researchers  will  continue  their  emphasis  in  building  automatic  layout 
generalors-silicon  compilers.  Although  silicon  compilers  are  useful,  I  do  not  think  they  wiU  solve 
all  the  problems  because  the  designer  still  have  to  define  the  microarchitecture  as  inputs  to  the  sil¬ 
icon  compiler.  My  experience  in  SPUR  has  convinced  me  the  following: 

•  Defining  a  microarchitecture  is  not  a  simple  task.  Therefore,  we  need  more  high  level 
design  tools  that  can  help  designer  make  quantitative  tradeoffs  decisions. 

•  Moving  to  the  future  means  back  to  the  basic!  Mead  and  Conway  design  style  is  not 
enough.  Building  a  high  performance  VLSI  chip  is  still  an  electrical  engineer’s  job.  We 
need  electrical  rules  checkers  that  understand  resistance,  capacitance,  and  inductance. 

•  I  think  VLSI  designers  can  do  CAD  tool  designers  a  big  favor  by  building  simple  tools.  No 
matter  how  simple  the  tool  is,  I  think  it  is  still  the  best  way  to  define  the  problem  formally. 

As  I  stated  in  Section  3.5, 1  do  not  believe  CAD  tools  are  there  to  replace  VLSI  designer. 
However,  I  do  believe  the  future  of  VLSI  design  depends  strongly  on  CAD  support  One  con¬ 
trasting  view  is  from  Nick  Tredennick  [Trc87]: 

This  book  is  partly  in  response  to  the  growing  presumption  that  computers  are  an  essential  part  of 
logic  design.  They  are  not  ...  I  think  of  computers  as  an  expensive  and  awkward  alternative  to 
pencil  and  paper. 

Nick  Tredennick,  Microprocessor  Logic  Design,  Page  4 

This  may  be  true  for  a  talented  VLSI  artist  such  as  Nick.  However,  for  an  average  VLSI  engineer 
like  myself,  CAD  tools  are  essential.  As  a  matter  of  fact,  I  and  my  colleagues  in  SPUR  have 
promised  ourselves  not  to  design  another  VLSI  chip  unless  we  have  more  CAD  support  for  all 
aspects  of  the  design.  I  think  the  introduction  of  CAD  support  to  VLSI  design  is  analogous  to  the 
introduction  of  jet  engine  to  aviation  at  the  end  of  World  War  n. 

CAD  support  enable  VLSI  designer  moves  much  faster  but  it  also  takes  some  of  the  art  out 
of  VLSI  desiga  Similarly,  a  jet  engine  enables  aircraft  to  fly  faster  but  it  also  takes  some  of  the 
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art  out  of  aviation.  In  many  ways,  a  jet  aircraft  is  easier  to  fly  than  a  propeUer  driven  aircraft 
because  a  spinning  propeller  creates  many  mysterious  effects  that  are  handled  by  pilots  more  like 
an  art  than  a  science.  Furthennore,  jet  engine  also  enables  aircraft  to  fly  much  higher  to  avoid 
most  bad  weather.  Consequently,  when  the  jet  engine  was  first  introduced,  many  "real  aviators" 
insisted  flying  propeller  driven  aircraft  was  still  the  only  true  art  of  aviation.  They  may  be  right. 
After  all,  how  can  you  argue  with  an  artist?  However,  most  people  probably  prefer  getting  to  their 
destination  in  one  hour-in  a  jet  aircraft  piloted  by  just  an  average  pilot-than  getting  to  their  desti¬ 
nation  in  four  hours  in  a  propeller  driven  aircraft,  piloted  by  a  "real  aviator." 
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Appendix  A 

DETAILED  DESCRIPTION  OF  THE 
SPUR  CPU  MICROARCHITECTURE 

Architecture  is  the  art  of  how  to  waste  space. 

Philip  Johnson,  1964 


A.I.  The  SPUR  CPU  Block  Diagram 

Figure  A-1-1  is  the  detailed  SPUR  CPU  block  diagram.  This  block  diagram  shows  the  rela¬ 
tive  position  of  each  block  in  the  layout.  The  following  naming  conventions  are  used  in  this  block 
diagram: 

•  Register  names  start  with  an  upper  case  letter  and  the  rest  are  lower  case  except  to  improve 
readability.  Examples:  Dstl  and  IfetPC. 

•  Functional  block  names  are  in  upper  case  letters  only.  Example:  ALU  and  EXT_INS. 

•  Signal  names  start  with  a  lower  case  letter  and  the  rest  are  lower  case  except  to  improve  rea¬ 
dability.  Examples:  busA  and  trapType. 

The  CPU  can  be  divided  into  two  units:  (1)  the  Instruction  Unit  (I-Unit)  at  the  upper  left 
comer  and  (2)  the  Execution  Unit  (E-Unit)  at  the  rest  of  the  area.  The  Instruction  Unit  (see  Sec¬ 
tion  2.2)  is  a  512-byte  direct-mapped  instruction  cache.  The  Execution  Unit  (see  Section  2.3) 
consists  of  4  parts: 

(1)  The  lower  datapath  performs  all  the  regisler-to-register  operations. 

(2)  The  upper  datapath  performs  all  the  program  control  operations. 
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The  datapath  is  split  into  two  parts  (see  Section  2.3):  The  upper  datapath  at  the  top  and  the  lower 
datapath  at  the  bottom.  The  CONTROL  UNIT  and  the  CACHE  CONTROLLER  INTERFACE 
are  in  the  middle.  The  upper  and  lower  datapaths  are  connected  by  busS.  There  are  three  major 
sets  of  10  pads:  The  DATA  PADS  on  the  left,  the  ADDRESS  PADS  on  the  right,  and  the  IN¬ 
STRUCTION  PADS  on  the  top.  The  INSTRUCTION  PADS  are  part  of  the  coprocessor  inter¬ 
face.  The  coprocessor  (FPU)  must  monitor  these  pads  continuously  to  detect  any  instruction  it 
has  to  execute. 


(3)  The  cache  controller  interface  communicates  with  the  cache  controller  chip. 

(4)  And  the  control  unit  controls  the  Execution  Unit. 

The  lower  datapath  contains  a  138  word-register  file,  some  temporary  registers,  and  several 
functional  units.  It  is  40  bits  wide  because  8  of  the  bits  are  used  for  tags.  The  upper  datapath  con¬ 
tains  some  special  registers  and  the  program  counters  logic.  It  is  30  bits  wide  because  all 
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instnictions  are  word  addressed.  The  CPU  chip  resides  inside  a  208-pad  pad  fraine.  The  CPU 
only  needs  about  180  pads  but  the  same  pad-frame  is  used  by  all  three  SPUR  custom  chips. 


Figure  A-2-1  The  SPUR  CPU  Registers  Set 


Each  register  window  has  ten  local  registers,  six  input  registers,  and  six  output  registers.  The  in¬ 
put  and  output  registers  of  adjacent  windows  overlap  and  are  used  for  parameters  passing.  Spe¬ 
cial  register  numbers  are  used  by  instructions  RD_SPECIAL  and  WR_SPECIAL  to  specify  the 
source  (Ssl)  and  destination  (Sd)  special  registers,  respectively.  Kpsw  and  Ins  do  not  have  any 
special  register  number  because  they  are  not  manipulated  by  the  RD_SPECIAL  nor  the 
WR_SPECIAL  instruction. 
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The  SPUR  CPU  register  set  (Figure  A-2-1)  consists  of  138  general  purpose  registers  and 
seven  special  registers.  The  138  general  purpose  registers  are  organized  into  10  global  registers 
and  eight  overlapping  register  windows.  The  seven  special  registers  are: 

Cwp<4:2>  Current  register  window  pointer.  Points  to  the  register  window  that  is  currently 
in  use. 

Swp<31:3>  Save  register  window  pointer.  Points  to  the  memory  location  where  the  last 
overflow  register  window  (points  to  by  Swp<9:7>)  is  saved. 

Kpsw<31;2>  Kernel  processor  status  word. 

Upsw<31:2>  User  processor  stams  word. 

Ins<l:0>  Insert  byte  count  register. 

ExecPC<31:2>  Program  counter  contains  the  address  of  the  instruction  currently  being  exe¬ 
cuted.  TThis  is  a  read  only  register. 

FpuPC<31:2>  Program  counter  contains  the  address  of  the  last  floating  point  instruction  send 
to  the  FPU  coprocessor.  This  is  a  read  only  register. 

Whenever  a  CALL  instruction  is  executed,  Cwp  is  incremented  by  one  and  a  new  window 
is  opened.  Conversely,  whenever  a  RETURN  instruction  is  executed,  Cwp  is  decremented  by  one 
and  the  window  is  closed.  When  window  overflow  occurs,  register  windows  are  saved  to 
memoiy.  Swp  contains  the  memory  address  at  which  the  last  register  window  is  saved.  The 
status  of  the  SPUR  CPU  is  stored  in  the  two  processor  status  words:  Kpsw  and  Upsw.  Both  the 
Kpsw  and  the  Upsw  are  30  bits  wide.  However,  as  shown  in  Figure  A-2-2,  only  a  small  number 
of  these  60  bits  are  used  by  the  hardware.  The  rest  of  the  bits  are  used  by  software  to  store 
relevant  process  information.  The  Ins  register  is  used  by  the  INSERT  instruction  to  decide  where 
within  a  word  should  the  byte  be  inserted.  Special  registers  ExecPC  and  FpuPC  are  read  only  and 
writing  to  them  are  the  same  as  NOOP. 
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Kernel  Processor  Status  Word  -  Kpsw; 


AllEn 

Error 

En 

Fault 

En 

Pre 

Mode 

User 

Ifet 

Vir 

Data 

Vir 

luEn 

luPre 

Enable 

Error 

T  ▼ 

Enable  Enable 

All  Traps  Fault 


Enable  User 

Interrupt  Mode 


Virtual 
Mode  for 
Data  Access 


Previous  Mode  Virtual  Mode  Enable 
Before  Trap  for  1.  Fetch  I-Unit 


Enable 

Prefetch 


User  Processor  Status  Word  -  Upsw: 


FPU 

FPU 

FPU  1 

TagTr 

GenTr 

InOv 

P?T. .  ■ 

Par 

ExEn  1 

En 

— Sn— 

_Eil_ 

Enable 

FPU 


Enable 
Tag  Trap 


Enable  Integer 
Overflow  Trap 


FPU  Enable  FPU  Enable 

Parallel  Mode  Exception  Generation  Trap 


Figure  A-2-2  Upsw  and  Kpsw  Bit  Assignments 


Each  bit’s  definition  specified  in  this  figure  is  the  meaning  of  the  bit  when  it  is  equal  to  1.  As 
shown  in  Figure  A-2-1,  Upsw<31:2>  and  Kpsw<31:2>  are  30  bits  wide.  This  figure  only  shows 
the  bits  that  are  used  by  the  hardware.  Bits  that  are  not  used  by  the  hardware  can  be  read  and 
written  by  software. 


A.3.  The  SPUR  CPU  Instruction  Set 

The  SPUR  CPU  instruction  set  [Tay85]  can  be  divided  into  seven  types  of  instructions:  (1) 
Load,  (2)  Register-Register,  (3)  Jump-Register  and  Return,  (4)  Read  and  Write  Special  Registers 
(5)  Store,  (6)  Compare-Branch,  and  (7)  Call-Jump.  These  instructions  are  summarized  in  Tables 
A-3-1  through  A-3-7.  Floating  point  instmctions  are  not  discussed  in  this  section  and  can  be 
found  in  [Bos88].  As  mentioned  in  Section  2.4.1,  unusual  conditions  can  arise  during  instruction 
execution  and  may  cause  a  trap.  These  unusual  conditions  are  listed  in  Table  A-3-8.  Finally, 
Table  A-3-9  shows  all  the  branch  conditions  for  aU  the  Compare-Branch  instructions. 

Load,  Jump-Register  and  Return,  and  Read  and  Write  Special  Registers  instructions  all 
have  the  same  format  as  Register-Register  instructions.  The  Ri  field  of  Register-Register 
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Load  Instructions 

Instruction 

Operands 

Cache 

Onerations 

Action 

Unusual 
Conditions 
rrable  A-3-8) 

LD_40 

LD  40  RO 

Rd,  Rsl.Ri 

RD64 

RF064 

Rd<39:0>  <-  Mem[Rsl+Ri] 

None 

CXR 

CXR  RO 

Rd,  Rsl,  Ri 

RD64 

RF064 

Rd<39:0>  <-  Mem[Rsl+Ri] 
LISP  pointer  check 

D 

LD_32 

LD_32_RO 

LD  32  RI 

Rd,  Rsl.Ri 

RD32 

PR32 

R032 

RA32 

Rd<31:0>  <-  Mem[Rsl+Ri] 

Rd<39:32>  <-  0x00 

None 

TEST_&_SET 

Rd,  Rsl.Ri 

TS32 

Rd<31;0>  <-  Mem[Rsl+Ri] 
Rd<39:32>  <-  0x00 

None 

LD_ 

EXTERNAL 

Rd,  Rsl.Ri 

RD_ 

CACHE 

Rd<31:0>  <-  CC  Reg[Rsl+Ri] 
Rd<39:32>  <-  0x00 

E 

Table  A-3-1  Load  Instructions 


instructions  can  either  be  a  register  specified  by  the  Rs2  field  or  the  sign  extension  of  the  14-bit 
immediate  field  (see  Figure  2-1-1).  In  order  to  support  the  Berkeley  Ownership  cache  con¬ 
sistency  protocol  [KEW85].  most  Load  instructions  have  two  favors-simple  read  or  read  for  own¬ 
ership  (opcode_RO).  LD_32  has  one  more  favor— the  RI  favor.  LD_32_RI  tells  the  Cache  Con¬ 
troller  to  ignore  any  page  fault  it  may  cause  and  provide  the  data  to  the  CPU  anyway.  LD_32  is 
also  the  only  Load  that  can  access  data  in  physical  mode.  Therefore  LD_32’s  Cache  Operations 
can  either  be  RD32  (virtual  mode)  or  PR32  (physical  mode).  TEST_&_SET  and 
LD_EXTERNAL  are  similar  to  LD_32  as  far  as  the  CPU  is  concerned.  The  only  difference  is 
their  Cache  Operations  which  will  be  handled  differently  by  the  Cache  Controller.  A  complete 
explanation  of  all  the  Cache  Operations  is  given  in  [WEG87]. 

The  Store  instructions  do  not  have  the  same  format  as  the  Register-Register  instructions.  Its 
immediate  field  (Imm)  is  the  sign  extension  of  the  14-bit  immediate  filed  formed  by  concatenat¬ 
ing  the  High  Imm  and  Low  Imm  fields  of  the  instructions  (see  Figure  2-1-1).  Since  ST_32  is  the 
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Register-Register  Instructions 

Instruction 

Operands 

Action 

ADD_NT 

Rd,  Rsl,  Ri 

Rd<31:0><- Rsl  +  Ri 

Rd<39:32>  <-  Rsl<39:32> 

None 

ADD 

Rd,  Rsl.Ri 

Rd<31:0><- Rsl  +  Ri 

Rd<39:32>  <-  Rsl<39:32> 

F&I 

SUB 

Rd,  Rsl,Ri 

Rd<31:0><-Rsl  -  Ri 

Rd<39:32>  <-  Rsl<39:32> 

F&I 

AND 

Rd,  Rsl,Ri 

Rd<31:0>  <-  Rsl  and  Ri 

Rd<39:32>  <-  Rsl<39:32> 

F 

OR 

Rd,  Rsl,Ri 

Rd<31:0>  <-  Rsl  or  Ri 

Rd<39:32>  <-  Rsl<39:32> 

F 

XOR 

Rd,  Rsl,Ri 

Rd<31;0>  <-  Rsl  xor  Ri 

Rd<39:32>  <-  Rsl<39:32> 

F 

SLL 

Rd,  Rsl,Ri 

Rd<31:0>  <-  Rsl<31:0>  shift 
left  by  Ri<l:0>  bits 

Rd<39:32>  <-  Rsl<39;32> 

F 

SRA 

Rd,  Rsl,Ri 

Rd<31;0>  <-  Rsl<31:0>  arithmetic 
shift  right  by  Ri<0>  bit 

Rd<39;32>  <-Rsl<39:32> 

F 

SRL 

Rd,  Rsl,Ri 

Rd<31:0>  <-  Rsl<31:0>  logic 
shift  right  by  Ri<0>  bit 

Rd<39:32>  <-  Rsl<39:32> 

F 

RD_TAG 

Rd,Rsl 

Rd<31:8><-0 

Rd<7:0><- Rsl<39:32> 

None 

EXTRACT 

Rd,  Rsl,Ri 

Rd<31:8>  <-  0 

Rd<7:0>  <-  Rsl  [byte  Ri<l:0>] 

None 

WR_TAG 

Rd,  Rsl,Ri 

Rd<3 1  ;0><- Rs  1<3 1 :0> 
Rd<39:32><-Ri<7:0> 

None 

INSERT 

Rd,  Rsl,Ri 

Rd[byte  lns<l:0>]  <-Ri<7:0> 

None 

Table  A-3-2  Register-Register  Instructions 


only  instruction  that  can  perform  store  in  physical  mode,  ST_32’s  Cache  Operations  can  either  be 
WR32  (virtual  mode)  orPW32  (physical  mode). 

Compare— Trap  is  a  special  case  of  Compare— Branch  in  which  the  branch  is  taken  as  a 
trap.  Compare  Branch  or  Compare  Trap’s  3rd  operand  can  either  be  Rc,  Rs2,  or  Tag  Imm 
depending  on  the  Cond  filed.  The  Rc  option  means  the  operand  can  either  be  a  register  specified 
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Jump-Recister  and  Return  Instructions 

Instruction 

Operands 

Action 

Unusual 
Conditions 
(Table  A-3-8) 

JUMP.REG 

Rsl.Ri 

PC<-Rsl  +RC 

None 

RETURN 

Rsl.Ri 

PC<-Rsl  +Ri 

Pop  to  previous  window;  Cwp  <-  Cwp  -  1 

B 

RETURN. 

TRAP 

Rsl.Ri 

PC  <-  Rsl  +  Ri 

Pop  to  previous  window:  Cwp  <-  Cwp  -  1 
Enable  all  traps;  Kpsw<AllEn>  <-  1 

B 

Table  A-3-3  Jump-Register  and  Return  Instructions 


by  the  Rs2  field  or  zero  extension  of  the  5-bit  Short  Immediate  field  (see  Figure  2-1-1). 


Read  and  Write  Special  Reeisters  Instructions 

Instruction 

Operands 

Action 

Unusual 
Conditions 
(Table  A-3-8) 

RD.SPECIAL 

Rd,  Ssl 

Rd<31:0><-Ssl 
Rd<39:32>  <-  0x00 

None 

RD.INSERT 

Rd 

Rd<31:0>  <-  Ins 

Rd<39:32>  <-  0x00 

None 

RD.KPSW 

Rd 

Rd<31:0>  <-  Kpsw 
Rd<39;32>  <-  0x00 

None 

Sd,  Rsl.Ri 

Sd<-Rsl  +Ri 

None 

Ri 

Ins  <-Ri<l:0> 

None 

RslRi 

Kpsw  <-  Rsl  +  Ri 

E 

INVALID.IB 

Invalidate  all  entries  in  the 
on-chip  instruction  cache 

None 

Table  A-3-4  Read  and  Write  Special  Registers  Instructions 
Ssl  and  Sd  are  specifiers  that  specify  the  special  registers  to  be  read  or  written. 
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Store  Instructions 


ST_40 


ST_32 


ST_ 

EXTERNAL 


Operands 

Cache 

Operations 

Action 

Rs2,  Rsl, 
Imm 

WR64 

Mem[Rsl+Imm]  <-  Rs2<39;0> 
Generation  Check 

Rs2,Rsl, 

Imm 

Mem[Rsl+Imm]  <-  Rs2<31:0> 

Rs2,  Rsl, 
Imm 

WR_ 

CACHE 

CCReg[Rsl+Imm]  <-Rs2<31:0> 

Table  A-3-5  Store  Instructions 


Instruction 


Cnmnare-Branch  Instructions 


Operands 


CMP  BRANCH 


CMP_BRANCH 


CMP  BRANCH 


always,  ge,  ne, 
gt,  never,  It, 
eq,  le,  uge, 
ugt,  ult,  ule 


cqjag,  eq_38, 
ne_tag,  ne_38 


eq_tag_imm, 

ne_tag_imm. 


Cond,  Rsl, 
Rc,  Offset 


Cond,  Rsl, 
Rs2,  Offset 


Cond,  Rsl, 
Tag  Imm, 


if(Cond=TRUE) 

PC  <-  PC  +  signExt(Offset) 


if(Cond=TRUE) 

PC  <-  PC  +  signExt(Offset) 


if(Cond=TRUE) 

PC  <-  PC  +  signExl(Offset) 


CMP_TRAP 

always,  ge,  ne, 
gt,  never.  It, 
eq,  le,  uge, 
ugt,  ult,  ule 

Cond,  Rsl, 

Rc,  Offset 

if(Cond=TRUE) 

Take  a  trap! 

CMP_TRAP 

eq_tag,  eq_38, 
ne_tag,  ne_38 

Cond,  Rsl, 
Rs2,  Offset 

if  (Cond=TRUE) 

Take  a  trap! 

eq_tag_imm. 

Cond,  Rsl, 

if  (Cond=TRUE) 

CMP_TRAP 

ne_tag_imm. 

Tag  Imm, 
Offset 

Take  a  trap! 

Table  A-3-6  Compare-Branch  Instructions 
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Call-Jump  Instructions 

Instruction 

Operands 

Action 

Unusual 
Conditions 
(Table  A-3-8) 

JUMP 

Word  Address 

PC  <-  PC<31:30>  cat  Word  Address 

None 

CALL 

Word  Address 

PC  <-  PC<31:30>  cat  Word  Address 
Open  new  window;  Cwp  <-  Cwp  +  1 
Save  PC;  RIO  (new  window)  <-  PC 

A 

Table  A-3-7  Call-Jump  Instructions 


Unusual 

Conditions 


Definition  and  Condition 


Window  Overflow; 

Attempt  to  execute  CALL  when  Cwp+1  =  Swp<9:7> _ 

Window  Underflow; 

Attempt  to  execute  RETURN  or  RETURN_TRAP  when  Cwp-1  =  Sv 


LISP  Pointer  Type  Violation; 

Rsl<37:32>  !=  CONS  or  NIL  _ 


Kernel  Mode  Violation; 

Attempt  to  execute  a  privilege  instruction  when  Kpsw<UserBit>  =  1 


LISP  Data  Type  Violation: 

Rsl<37;32>  !=  HXNUM  or  Rs2<39:32>  !=  HXNUM _ 


LISP  Data  Type  Violation: 

(Rsl<37:32>  !=  HXNUM  or  Rs2<39:32>  !=  HXNUM)  and 
(Rsl<37:32>  !=  CHARACTER  or  Rs2<39:32>  !=  CHARACTER) 


Generation  Violation: 

Rs2<39:38>  >  Rsl<39:38>  _ 


Integer  Overflow  _ _ 


Compare  trap  with  valid  condition 


Table  A-3-8  The  SPUR  CPU  Unusual  Conditions 
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Mnemonic 

Binary  (Hex.) 

Branch  Conditions 

Notes 

ALWAYS 

00  000(00) 

Always 

Branch 

GE 

00  001  (01) 

Rsl<31:0> 

> 

Rc<31:0> 

NE 

00  010  (02) 

Rsl<31:0> 

Rc<31:0> 

1.2 

GT 

00  011  (03) 

Rsl<31:0> 

> 

Rc<31:0> 

NEVER 

00  100  (04) 

Never 

Branch 

LT 

00  101  (05) 

Rsl<31:0> 

< 

Rc<31:0> 

EQ 

00110(06) 

Rsl<31:0> 

= 

Rc<31:0> 

1,2 

LE 

00  111  (07) 

Rsl<31:0> 

< 

Rc<31:0> 

UGE 

01  001  (09) 

Rsl<31:0> 

> 

Rc<31:0> 

UGT 

01  011  (OB) 

Rsl<31:0> 

> 

Rc<31:0> 

3 

ULT 

01  101  (OD) 

Rsl<31:0> 

< 

Rc<31:0> 

ULE 

01  111  (OF) 

Rsl<31:0> 

< 

Rc<31:0> 

FPU_TRUE 

10  000(10) 

fpuBrT_F_C4 

= 

1 

4 

EQ_TAG 

10001  (11) 

Rsl<37:32> 

= 

Rs2<37:32> 

5 

EQ_38 

10  011(13) 

Rsl<37:0> 

= 

Rs2<37:0> 

FPU_FALSE 

10  100  (14) 

fpuBrT_F_C4 

= 

0 

4 

NE_TAG 

10  101  (15) 

Rsl<37;32> 

Rs2<37:32> 

NE_38 

10111(17) 

Rsl<37:0> 

Rs2<37:0> 

5 

ECLTC 

11001(19) 

Rsl<37:32> 

= 

Tag_Imm 

NE_TC 

11  101  (ID) 

Rsl<37:32> 

* 

Tagjmm 

Table  A-3-9  The  SPUR  CPU  Branch  Conditions 

Notes: 

1.  If  busl<14>  =  0,  Rc  =  Rs2.  Otherwise,  Rc  =  Zero  Ext  (Short  Imm). 

2.  Rsl  and  Rc  are  treated  as  2’s  complement  signed  integers. 

3.  Rsl  and  Rc  are  treated  as  unsigned  integers. 

4.  fpuBrT_F_C4  is  an  external  input  coming  from  the  FPU. 

5.  Only  the  type  tag  are  checked.  Generation  numbers  are  ignored. 


A.4.  Special  Cases  of  Register-Register  Instructions 

Load,  Jump-Register  and  Return,  Read  Special  Registers,  and  Write  Special  Registers 
instructions  can  be  considered  as  special  cases  of  the  Register-Register  instructions.  The  timing 
of  Load  operation  is  shown  in  Table  A-4-1.  It  is  similar  to  Register-Register  operation  except  the 
ALU  output  is  sent  out  as  effective  address  and  the  data  from  memory  is  written  into  the  destina¬ 
tion  register.  The  timing  of  Jump-Register  and  Return  operations  is  shown  in  Table  A-4-2.  In 
this  case,  the  ALU  output  is  sent  to  the  upper  datapath  and  then  the  Instruction  Unit  as  the  target 
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address.  The  timing  of  Read  and  Write  Special  Registers  operations  are  shown  in  Table  A-4-3 
and  Table  A-4-4  respectively.  They  are  similar  to  Register-Register  operation  except  special 
registers  are  involved  instead  of  general  purpose  registers. 


Staee/Phase 

Actions 

Ifet  Stage: 
Phase  3 

busi  <-  I-UnitfbusPCI : 

Exec  Stage: 

Phase  1 

busA  <-  REG_FILE[Rsl].  busB  <-  (not  REG_FILE[Rs2]) ; 
BUSBUFA  <-  busA, 
if  (busl<14>=0)  BUSBUFB  <-  (not  busB) 
else  BUSBUFB  <-  Sign  Extend  (busl<13;0>) ; 

Phase  2 

busA2  <-  BUSBUFA.  busB2  <-  BUSBUFB  ; 

Port  A  of  ALU  <-  busA2,  Port  B  of  ALU  <-  busB2 ; 

Phase  4 

busS  <-  ALU ; 

Address  Pads  <-  busS  : 

Mem  Stage: 

Phase  1 

busPC  <-  INC ; 

IfetPC  <-  busPC,  I-Unit  <-  busPC ; 

Phase  3 

busL  <-  Data  Pads ; 

Dst2  <-  busL ; 

Wr  Stage: 

Phase  3 

busA  <-  Dstl,  busB  <-  (not  Dst2) ; 

REG  FTLEirdl  <-  (busA  &  (not  busB)) ; 

Table  A-4-1  Load  Operation 
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Stage/Phase 

Actions 

Ifet  Stage: 
Phase  3 

busi  <-  I-UnitfbusPCl ; 

Exec  Stage: 

Phase  1 

busA  <-  REG_FILE[Rsl].  busB  <-  (not  REG_FILE[Rs2])  ; 
BUSBUFA  <-  busA, 
if  (busl<14>=0)  BUSBUFB  <-  (notbusB) 
else  BUSBUFB  <-  Sign  Extend  (busl<13:0>)  ; 

Phase  2 

if  (opcode  =  RETURN)  Cwp  <-  Cwp  -  1, 
busA2  <-  BUSBUFA,  busB2  <-  BUSBUFB  ; 

Port  A  of  ALU  <-  busA2,  Port  B  of  ALU  <-  busB2 ; 

Phase  4 

busS  <-  ALU ; 

BUSS2PC  <-  busS ; 

Mem  Stage: 

Phase  1 

busPC  <-  BUSS2PC ; 

IfetPC  <-  busPC.  I-Unit  <-  busPC ; 

Wr  Stage: 
Phase  3 

if  (opcode  =  RETURN)  Update  Backup  Copy  of  Cwp  ; 

Table  A-4-2  Jump-Register  and  Return  Operations 


Table  A-4-3  Read  Special  Registers  Operation 
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Table  A-4-4  Write  Special  Registers  Operation 
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Appendix  B 

THE  SPUR  CPU  PROBLEMS  REPORT 


The  great  liability  of  the  engineer  compared  to  men  of  other  professions 
is  that  his  works  are  out  in  the  open  where  all  can  see  them. ... 

If  his  works  do  not  work,  he  is  damned. 

Hebert  Hoover,  1916 


This  appendix  lists  all  the  known  SPUR  CPU  problems.  Unless  otherwise  specified,  all 
these  problems  can  stiU  be  found  in  the  second  version  of  the  SPUR  CPU.  The  solutions  to  these 
problems  are  also  listed.  The  SPUR  CPU  problems  can  be  classified  into  three  groups:  (1) 
microarchitectural  problems,  (2)  electrical  problems,  and  (3)  implementation  problems. 

B.l.  Microarchitectural  Problems 

The  CPU  chip  is  doing  exactly  what  the  microarchitect  designed  it  to  do  although  it  is  not 
doing  what  the  microarchitect  wanted  it  to  do.  The  microarchitect  has  designed  it  wrong!  These 
problems  can  be  simulated  in  behavioral  and  switch  level  simulation.  They  were  not  detected  dur¬ 
ing  simulation  because  we  did  not  cover  all  possible  cases  or  we  did  not  realize  they  were  prob¬ 
lems.  The  SPUR  CPU  microarchitectural  problems  are: 

(1)  The  SPUR  CPU  does  not  allow  two  consecutive  instructions  to  modify  the  the  same  spe¬ 
cial  register  (special  registers  are  listed  in  Appendix  A.2).  The  problem  is  that  the  SPUR 
CPU  cannot  recover  the  special  register  if  the  second  instruction  is  trapped.  The  software 
solution  is  to  avoid  writing  code  that  will  modify  the  same  special  register  in  two  con¬ 
secutive  instructions.  The  hardware  solution  is  to  add  one  more  temporary  latch  between 
the  special  register  and  its  backup  copy  (see  Figure  3-3-1). 
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(2)  The  SPUR  CPU  cannot  recover  from  an  interrupt  if  the  second  instruction  that  is  being 
killed  is  a  CALL  or  RETURN  instruction.  This  problem  and  its  software  solution  are  dis¬ 
cussed  in  Section  3.3.1.  The  same  hardware  solution  that  can  solve  Microarchitectural 
Problem  1  above  can  also  solve  this  problem. 

(3)  The  SPUR  CPU  treats  the  internal  instructions  just  like  any  other  normal  instructions. 
The  user  can,  therefore,  use  the  internal  instructions  in  his  program  to  crash  the  system. 
This  is  a  security  hole  and  we  did  not  have  any  software  solution  for  it.  The  hardware 
solution  is  to  change  the  SPUR  CPU  instruction  decoder  such  that  it  treats  all  internal 
instructions  as  privilege  instmctions.  Whenever  a  user  program  attempts  to  execute  an 
internal  instruction,  the  SPUR  CPU  should  take  a  mode  violation  trap. 

(4)  The  CPU  and  the  Cache  Controller  assume  different  meanings  for  the  cacheBusy  signal. 
The  CPU  assumes  it  will  stay  asserted  during  the  entire  cache  operation.  The  Cache  Con¬ 
troller,  on  the  other  hand,  assumes  it  can  disassert  it  in  the  middle  of  a  cache  operation  as 
long  as  the  dataValid  signal  remains  disasserted.  Consequenfly,  when  the  Cache  Con¬ 
troller  disasserts  cacheBusy  momentarily  in  the  middle  the  TEST_AND_SET  operation, 
the  SPUR  CPU  is  confused  and  starts  prefetching  instruction  erroneously  if  the  I-Unit  is 
enabled.  We  did  not  have  any  software  solution  for  this  problem.  The  easiest  hardware 
solution  is  to  put  some  "glue"  logic  on  the  processor  board. 

B.2.  Electrical  Problems 

The  CPU  chip  is  not  doing  what  the  microarchitect  nor  the  logic  designer  designed  it  to  do 
due  to  unexpected  electrical  problems.  These  problems  cannot  be  simulated  in  behavioral  nor 
switch  level  simulation.  Careful  and  in-depth  circuit  simulation  is  the  only  way  to  detect  these 
problems.  These  problems  exist  because  the  switch  level  simulation  is  not  low  level  enough  and 
it  is  not  practical  to  run  circuit  simulation  for  the  entire  chip.  The  SPUR  CPU  electrical  problems 


are: 
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(1)  A  hazardous  circuit  caused  the  CALL  instruction  not  able  to  save  the  return  address  prop¬ 
erly  in  the  first  version  of  the  SPUR  CPU.  This  problem  and  its  solutions  are  discussed  in 
Section  3.3.2. 1. 

(2)  Misplaced  well  and  substrate  contacts  resulted  in  stuck  at  "0"  and  stuck  at  "1"  problems 
in  the  first  version  of  the  SPUR  CPU.  This  problem  and  its  solutions  are  discussed  in  Sec¬ 
tion  3.3.2.2. 

(3)  There  is  small  gap  in  one  of  the  wires  in  instruction  unit.  We  solved  this  problem  by  per- 
fonning  micro-surgery  on  the  chip  to  connect  the  broken  wire  using  the  "Focused  Ion 
Beam  IC  Development  System"  available  from  Seiko  Instmment  Inc.  This  problem  was 
introduced  when  we  were  fixing  problems  from  the  first  version. 

(4)  The  SPUR  CPU  is  not  ignoring  interrupts  during  global  pipeline  suspension  causes  by 
external  cache  miss.  This  problem  is  caused  by  a  race  condition  in  the  trap  logic.  The 
easiest  hardware  solution  is  to  put  some  "glue"  logic  on  the  processor  board. 

B  J.  Implementation  Problems 

Implementation  problems  occur  when  the  CPU  chip  is  doing  exactly  what  the  logic  or  cir¬ 
cuit  designer  designed  it  to  do  although  it  is  not  doing  what  the  microarchitect  want  it  to  do.  The 
logic  or  circuit  designer  implement  something  differently  than  what  the  microarchitect  has  in 
mind!  These  problems  may  be  detected  by  comparing  the  switch  level  simulation  results  against 
behavioral  level  simulation  results  if  both  the  switch  level  and  behavioral  level  descriptions  have 
the  proper  level  of  details.  These  problems  exist  because  of  miscommunication  between  the 
microarchitect  and  the  logic  or  circuit  designer.  The  SPUR  CPU  only  has  one  implementation 
problem: 

(1)  The  backup  copy  of  all  special  registers  were  implemented  as  dynamic  registers.  They 
should  be  static  or  pseudo  static  registers.  This  problem  and  its  solution  is  discussed  in 


Section  3.3.3. 


